perm filename V242.TEX[TEX,DEK]9 blob
sn#427780 filedate 1979-03-27 generic text, type C, neo UTF8
COMMENT ⊗ VALID 00024 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00004 00002 \input acphdr % Section 4.2
C00005 00003 %folio 259 galley 7b (C) Addison-Wesley 1978 *
C00013 00004 %folio 262 galley 1 (C) Addison-Wesley 1978 *
C00023 00005 %folio 265 galley 2 (C) Addison-Wesley 1978 *
C00031 00006 %folio 269 galley 3 (C) Addison-Wesley 1978 *
C00039 00007 %folio 273 galley 4 (C) Addison-Wesley 1978 *
C00054 00008 %folio 276 galley 5 (C) Addison-Wesley 1978 *
C00067 00009 %folio 282 galley 6 (C) Addison-Wesley 1978 *
C00079 00010 %folio 285 galley 7 WARNING: Much tape unreadable! (C) Addison-Wesley 1978 *
C00091 00011 %folio 291 galley 8 (C) Addison-Wesley 1978 *
C00102 00012 %folio 293 galley 9 (C) Addison-Wesley 1978 *
C00110 00013 %folio 301 galley 10 (C) Addison-Wesley 1978 *
C00123 00014 %folio 300 galley 11 (C) Addison-Wesley 1978 *
C00137 00015 %folio 306 galley 12 (C) Addison-Wesley 1978 *
C00155 00016 %folio 313 galley 13 (C) Addison-Wesley 1978 *
C00169 00017 %folio 317 galley 14 (C) Addison-Wesley 1978 *
C00184 00018 %folio 322 galley 15-16 WARNING: Much tape unreadable. (C) Addison-Wesley 1978 *
C00202 00019 %folio 328 galley 17 (C) Addison-Wesley 1978 *
C00217 00020 %folio 330 galley 18 (C) Addison-Wesley 1978 *
C00226 00021 %folio 333 galley 19 (C) Addison-Wesley 1978 *
C00235 00022 %folio 336 galley 20 (C) Addison-Wesley 1978 *
C00243 00023 %folio 338 galley 21 (C) Addison-Wesley 1978 *
C00257 00024 \vfill\eject
C00258 ENDMK
C⊗;
\input acphdr % Section 4.2
\runninglefthead{ARITHMETIC} % chapter title
\titlepage\setcount00
\null
\vfill
\tenpoint
\ctrline{SECTION 4.2 of THE ART OF COMPUTER PROGRAMMING}
\ctrline{$\copyright$ 1978 Addison--Wesley Publishing Company, Inc.}
\vfill
\runningrighthead{FLOATING-POINT ARITHMETIC}
\section{4.2}
\eject
\setcount0 196
%folio 259 galley 7b (C) Addison-Wesley 1978 *
\sectionbegin{4.2. FLOATING-POINT ARITHMETIC}
I{\:cN THIS SECTION}, we shall study the basic principles
of doing arithmetic on ``floating-point'' numbers, by analyzing
the internal mechanisms underlying such calculations. Perhaps many readers
will have little interest in this subject, since their computers
either have built-in floating-point instructions or their computer
manufacturer has supplied suitable subroutines. But, in fact,
the material of this section should not merely be the concern
of computer-design engineers or of a small clique of people
who write library subroutines for new machines; {\sl every}
well-rounded programmer ought to have a knowledge of what goes
on during the elementary steps of floating-point arithmetic.
This subject is not at all as trivial as most people think;
it involves a surprising amount of interesting information.
\runningrighthead{SINGLE-PRECISION CALCULATIONS}
\section{4.2.1}
\sectionskip
\sectionbegin{4.2.1. Single-Precision Calculations}
\def\\{\raise 3.444pt\hbox{$\scriptscriptstyle\!\!\not\,\,$}} % \\h will be h bar
{\bf A. Floating-point notation.}\xskip We have
discussed ``fixed-point'' notation for numbers in Section 4.1;
in such a case the programmer knows where the radix point is
assumed to lie in the numbers he manipulates. For many purposes
it is considerably more convenient to let the position of the
radix point be dynamically variable or ``floating'' as a program
is running, and to carry with each number an indication of its
current radix point position. This idea has been used
for many years in scientific calculations, especially for expressing
very large numbers like Avogadro's number $N = 6.02252 \times
10↑{23}$, or very small numbers like Planck's constant $\\h = 1.0545
\times 10↑{-27}$ erg sec.
In this section we shall work with {\sl base $b$, excess $q$, floating-point
numbers with $p$ digits}\/: Such numbers will be represented by pairs of values
$(e, f)$, denoting
$$(e, f) = f \times b↑{e-q}.\eqno (1)$$
Here $e$ is an integer having a specified range,
and $f$ is a signed fraction. We will adopt the convention that
$$|f| < 1;$$
in other words, the radix point appears at the
left of the positional representation of $f$. More precisely,
the stipulation that we have $p$-digit numbers means that $b↑pf$
is an integer, and that
$$-b↑p < b↑pf < b↑p.\eqno (2)$$
The term ``floating binary'' implies that $b =
2$, ``floating decimal'' implies $b = 10$, etc. Using excess
50 floating-decimal numbers with 8 digits, we can write, for
example,
$$\vcenter{\halign{\lft{#}\quad⊗\rt{$#$}⊗\lft{$\;#$}\cr
Avogadro's number⊗N⊗ = (74, +.60225200);\cr
\noalign{\vskip 2pt}
Planck's constant⊗\\h⊗= (24, +.10545000).\cr}}\eqno(3)$$
The two components $e$ and $f$ of a floating-point
number are called the {\sl exponent} and the {\sl fraction}
parts, respectively.\vskip (Other names are occasionally used for
this purpose, notably ``characteristic'' and ``mantissa''; but
it is an abuse of terminology to call the fraction part a mantissa,
since this concept has quite a different meaning in connection
with logarithms. Furthermore the English word mantissa means
``a worthless addition.'')
The \MIX\ computer assumes that its floating-point numbers
have the form
$$\def\\{\vrule height 12.4pt depth 7.4pt}
\vcenter{\hbox to 124pt{\hfill\vbox{\baselineskip0pt\lineskip0pt
\hrule
\hbox{\\ \hbox to 20pt{\hfill$\pm$\hfill\\}\!
\hbox to 20pt{\hfill$e$\hfill\\}\!
\hbox to 20pt{\hfill$f$\hfill\\}\!
\hbox to 20pt{\hfill$f$\hfill\\}\!
\hbox to 20pt{\hfill$f$\hfill\\}\!
\hbox to 20pt{\hfill$f$\hfill\\}}
\hrule}\hfill}}.\eqno(4)$$
Here we have base $b$, excess $q$, floating-point
notation with four bytes of precision, where $b$ is the
byte size (e.g., $b = 64$ or $b = 100)$, and $q$ is equal to
$\lfloor {1\over 2}b\rfloor $. The fraction part is $\pm\,f\,f\,f\,f$,
and $e$ is the exponent, which lies in the range $0 ≤
e < b$. This internal representation is typical of the conventions
in most existing computers, although $b$ is a much larger base
than usual.
\subsectionbegin{B. Normalized calculations} A floating-point
number $(e, f)$ is said to be {\sl normalized} if the most significant
digit of the representation of $f$ is nonzero, so that
$$1/b ≤ |f| < 1;\eqno (5)$$
or if $f = 0$ and $e$ has its smallest possible
value. It is possible to tell which of two normalized floating-point
numbers has a greater magnitude by comparing the exponent parts
first, and then testing the fraction parts only if the exponents
are equal.
%folio 262 galley 1 (C) Addison-Wesley 1978 *
Most floating-point routines now in use deal almost
entirely with normalized numbers: inputs to the routines are
assumed to be normalized, and the outputs are always normalized.
Under these conventions we lose the ability to represent a few
numbers of very small magnitude---for example, the value $(0,
.00000001)$ can't be normalized without producing a negative
exponent---but we gain in speed, uniformity, and the ability
to give relatively simple bounds on the relative error in our
computations.\xskip (Unnormalized floating-point arithmetic is discussed
in Section 4.2.2.)
Let us now study the normalized floating-point
operations in detail. At the same time we can consider the construction
of subroutines for these operations, assuming that we have
a computer without built-in floating-point hardware.
Machine-language subroutines for floating-point
arithmetic are usually written in a very machine-dependent manner,
using many of the wildest idiosyncrasies of the computer at
hand; so floating-point addition subroutines
for two different machines usually bear little superficial resemblance
to each other. Yet a careful study of numerous subroutines
for both binary and decimal computers reveals that these programs
actually have quite a lot in common, and it is possible to discuss
the topics in a machine-independent way.
\yskip The first (and by far the most difficult!) algorithm we shall
discuss in this section is a procedure for floating-point addition,
$$(e↓u, f↓u) \oplus (e↓v, f↓v) = (e↓w, f↓w).\eqno (6)$$
{\sl Note: Since floating-point arithmetic is
inherently approximate, not exact, we will use ``round'' symbols
$$\oplus ,\quad \ominus ,\quad \otimes ,\quad \odiv$$
to stand for floating-point addition, subtraction,
multiplication, and division, respectively, in order to distinguish
approximate operations from the true ones.}
\yskip The basic idea involved in floating-point
addition is fairly simple: Assuming that $e↓u ≥ e↓v$, we take
$e↓w = e↓u$, $f↓w = f↓u + f↓v/b↑{e↓u-e↓v}$ (thereby aligning the
radix points for a meaningful addition), and normalize the result.
Several situations can arise that make this process nontrivial,
and the following algorithm explains the method more precisely.
\algbegin Algorithm A (Floating-point addition).
Given base $b$, excess $q$, $p$-digit, normalized floating-point
numbers $u = (e↓u, f↓u)$ and $v = (e↓v, f↓v)$, this algorithm
forms the sum $w = u \oplus v$. The same procedure may be used
for floating-point subtraction, if $-v$ is substituted for $v$.
\algstep A1. [Unpack.] Separate the
exponent and fraction parts of the representations
of $u$ and $v$.
\topinsert{\vskip 60mm
\ctrline{\caption Fig.\ 2. Floating-point addition.}}
\algstep A2. [Assume $e↓u ≥ e↓v$.] If $e↓u < e↓v$, interchange
$u$ and $v$.\xskip (In many cases, it is best to combine step A2 with
step A1 or with some of the later steps.)
\algstep A3. [Set $e↓w$.] Set $e↓w ← e↓u$.
\algstep A4. [Test $e↓u - e↓v$.] If $e↓u - e↓v ≥ p + 2$ (large
difference in exponents), set $f↓w ← f↓u$ and go to step A7.\xskip
(Actually, since we are assuming that $u$ is normalized, we
could terminate the algorithm; but it is often useful to be
able to normalize a possibly unnormalized number by adding zero
to it.)
\algstep A5. [Scale right.] Shift $f↓v$ to the right $e↓u -
e↓v$ places; i.e., divide it by $b↑{e↓u-e↓v}$.\xskip [{\sl Note:} This
will be a shift of up to $p + 1$ places, and the next step (which
adds $f↓u$ to $f↓v$) thereby requires an accumulator capable
of holding $2p + 1$ base-$b$ digits to the right of the radix
point. If such a large accumulator is not available, it is possible
to shorten the requirement to $p + 2$ or $p + 3$ places if proper
precautions are taken; the details are given in exercise 5.]
\algstep A6. [Add.] Set $f↓w ← f↓u + f↓v$.
\algstep A7. [Normalize.] (At this point $(e↓w, f↓w)$ represents
the sum of $u$ and $v$, but $|f↓w|$ may have more than $p$ digits,
and it may be greater than unity or less than $1/b$.)\xskip Perform
Algorithm N below, to normalize and round $(e↓w, f↓w)$ into
the final answer.\quad\blackslug
\algbegin Algorithm N (Normalization).
A ``raw exponent'' $e$ and a ``raw fraction'' $f$ are converted
to normalized form, rounding if necessary to $p$ digits. This
algorithm assumes that $|f| < b$.
\algstep N1. [Test $f$.] If $|f|≥1$, (``fraction
overflow''), go to step N4. If $f = 0$, set $e$ to its lowest
possible value and go to step N7.
\topinsert{\vskip 56mm
\ctrline{\caption Fig.\ 3. Normalization of $(e,f)$.}}
\algstep N2. [Is $f$ normalized?] If $|f| ≥ 1/b$, go to step
N5.
\algstep N3. [Scale left.] Shift $f$ to the left by one digit
position (i.e., multiply it by $b$), and decrease $e$ by 1.
Return to step N2.
\algstep N4. [Scale right.] Shift $f$ to the right by one digit
position (i.e., divide it by $b$), and increase $e$ by 1.
\algstep N5. [Round.] Round $f$ to $p$ places.\xskip (We take this
to mean that $f$ is changed to the nearest multiple of $b↑{-p}$.
It is possible that $(b↑pf)\mod 1 = {1\over 2}$ so that there
are {\sl two} nearest multiples; if $b$ is even, we choose the
one that makes $b↑pf + {1\over 2}b$ odd. Further discussion
of rounding appears in Section 4.2.2.) It is important to note
that this rounding operation can make $|f| = 1$ (``rounding
overflow''); in such a case, return to step N4.
\algstep N6. [Check $e$.] If $e$ is too large, i.e., larger
than its allowed range, an {\sl exponent overflow} condition
is sensed. If $e$ is too small, an {\sl exponent underflow}
condition is sensed.\xskip (See the discussion below; since the result
cannot be expressed as a normalized floating-point number in
the required range, special action is necessary.)
\algstep N7. [Pack.] Put $e$ and $f$ together into the desired
output representation.\quad\blackslug
\yyskip Some simple examples of floating-point addition are given in exercise 4.
%folio 265 galley 2 (C) Addison-Wesley 1978 *
\yyskip The following \MIX\ subroutines, for addition and subtraction
of numbers having the form (4), show how Algorithms A and N
can be expressed as computer programs. The subroutines below
are designed to take one input $u$ from symbolic location \.{ACC},
and the other input $v$ comes from register A upon entrance
to the subroutine. The output $w$ appears both in register A
and location \.{ACC}. Thus, a fixed-point coding sequence
$$\.{LDA A;\quad ADD B;\quad SUB C;\quad STA D}\eqno (7)$$
would correspond to the floating-point coding sequence
$$\.{LDA A, STA ACC;\quad LDA B, JMP FADD;\quad
LDA C, JMP FSUB;\quad STA D}.\eqno (8)$$
\vskip -6pt
\algbegin Program A (Addition, subtraction,
and normalization). The following program is a subroutine
for Algorithm A\null, and it is also designed so that the normalization
portion can be used by other subroutines that appear later
in this section. In this program and in many others
throughout this chapter, \.{OFLO} stands for a subroutine that
prints out a message to the effect that \MIX's overflow toggle
was unexpectedly found to be ``on.'' The byte size $b$ is assumed
to be a multiple of 4. The normalization routine \.{NORM} assumes
that $\rI2 = e$ and $\rAX = f$, where $\rA = 0$ implies $\rX = 0$ and
$\rI2 < b$.
\yyskip
{\mixfour{00⊗BYTE⊗EQU⊗1(4:4)⊗Byte size $b$\cr
01⊗EXP⊗EQU⊗1:1⊗Definition of exponent field\cr
02⊗FSUB⊗STA⊗TEMP⊗Floating-point subtraction subroutine:\cr
03⊗⊗LDAN⊗TEMP⊗Change sign of operand.\cr
04⊗FADD⊗STJ⊗EXITF⊗Floating-point addition subroutine:\cr
05⊗⊗JOV⊗OFLO⊗Ensure overflow is off.\cr
06⊗⊗STA⊗TEMP⊗$\.{TEMP}← v$.\cr
07⊗⊗LDX⊗ACC⊗$\rX ← u$.\cr
\\
08⊗⊗CMPA⊗ACC(EXP)⊗$\underline{\hbox{Ste}}$p\hskip-3pt$\underline{\hbox{\hskip3pt
s A1}}$,\hskip-1pt$\underline
{\hbox{\hskip1pt\ A2}}$,\hskip-1pt$\underline{\hbox{\hskip1pt\
A3 are combined here:}}$\cr
09⊗⊗JGE⊗1F⊗Jump if $e↓v ≥ e↓u$.\cr
10⊗⊗STX⊗FU(0:4)⊗$\.{FU} ← \pm\,f\,f\,f\,f\,0$.\cr
11⊗⊗LD2⊗ACC(EXP)⊗$\rI2 ← e↓w$.\cr
12⊗⊗STA⊗FV(0:4)⊗\cr
13⊗⊗LD1N⊗TEMP(EXP)⊗$\rI1 ← -e↓v$.\cr
14⊗⊗JMP⊗4F\cr
\\
15⊗1H⊗STA⊗FU(0:4)⊗$\.{FU} ← \pm\,f\,f\,f\,f\,0$ ($u, v$ interchanged).\cr
16⊗⊗LD2⊗TEMP(EXP)⊗$\rI2 ← e↓w$.\cr
17⊗⊗STX⊗FV(0:4)\cr
18⊗⊗LD1N⊗ACC(EXP)⊗$\rI1 ← -e↓v$.\cr
\noalign{\penalty-500} % very good place to break
19⊗4H⊗INC1⊗0,2⊗$\rI1 ← e↓u - e↓v$.\xskip(Step A4 unnecessary.)\cr
\\
20⊗5H⊗LDA⊗FV⊗\understep{A5. Scale ri}\sl g\understep{ht.}\cr
21⊗⊗ENTX⊗0⊗Clear rX.\cr
22⊗⊗SRAX⊗0,1⊗Shift right $e↓u - e↓v$ places.\cr
\\
23⊗6H⊗ADD⊗FU⊗\understep{A6. Add.}\cr
\\
24⊗⊗JOV⊗N4⊗\understep{A7. Normalize.} Jump if fraction overflow.\cr
25⊗⊗JXZ⊗NORM⊗Easy case?\cr
\\
26⊗⊗CMPA⊗=0=(1:1)⊗Is $f$ normalized?\cr
27⊗⊗JNE⊗N5⊗If so, round it.\cr
28⊗⊗SRC⊗5⊗$|\rX|↔|\rA|$.\cr
29⊗⊗DECX⊗1⊗(rX is positive.)\cr
30⊗⊗STA⊗TEMP⊗(Operands had opposite signs,\cr
31⊗⊗STA⊗HALF(0:0)⊗\quad registers must be adjusted\cr
32⊗⊗LDAN⊗TEMP⊗\quad before rounding and normalization.)\cr
33⊗⊗ADD⊗HALF\cr
34⊗⊗ADD⊗HALF⊗Complement least significant half.\cr
35⊗⊗SRC⊗4⊗Jump into normalization routine.\cr
36⊗⊗JMP⊗N3A\cr
\\
37⊗HALF⊗CON⊗1//2⊗One half the word size (Sign varies)\cr
38⊗FU⊗CON⊗0⊗Fraction part $f↓u$\cr
39⊗FV⊗CON⊗0⊗Fraction part $f↓v$\cr
\noalign{\penalty-300\vskip 3pt plus 6pt minus 2pt}
40⊗NORM⊗JAZ⊗ZRO⊗\understep{N1. Test }$f$\hskip-2pt\understep{\hskip2pt.}\cr
41⊗N2⊗CMPA⊗=0=(1:1)⊗\understep{N2. Is }$f$\hskip-2pt\understep{\hskip2pt\
normalized?}\cr
42⊗⊗JNE⊗N5⊗To N5 if leading byte nonzero.\cr
\\
43⊗N3⊗SLAX⊗1⊗\understep{N3. Scale left.}\cr
44⊗N3A⊗DEC2⊗1⊗Decrease $e$ by 1.\cr
45⊗⊗JMP⊗N2⊗Return to N2.\cr
\\
46⊗N4⊗ENTX⊗1⊗\understep{N4. Scale ri}\sl g\understep{ht.}\cr
47⊗⊗SRC⊗1⊗Shift right, insert ``1'' with proper sign.\cr
48⊗⊗INC2⊗1⊗Increase $e$ by 1.\cr
\\
49⊗N5⊗CMPA⊗=BYTE/2=(5:5)⊗\understep{N5. Round.}\cr
50⊗⊗JL⊗N6⊗Is $|$tail$| < {1\over 2}b$?\cr
51⊗⊗JG⊗5F\cr
52⊗⊗JXNZ⊗5F⊗Is $|$tail$| > {1\over 2}b$?\cr
53⊗⊗STA⊗TEMP⊗$|$tail$| = {1\over 2}b$; round to odd.\cr
54⊗⊗LDX⊗TEMP(4:4)\cr
55⊗⊗JXO⊗N6⊗To N6 if rX is odd.\cr
\\
56⊗5H⊗STA⊗*+1(0:0)⊗Store sign of rA.\cr
57⊗⊗INCA⊗BYTE⊗Add $b↑{-4}$ to $|f|$.\xskip (Sign varies)\cr
58⊗⊗JOV⊗N4⊗Check for rounding overflow.\cr
\\
59⊗N6⊗J2N⊗EXPUN⊗\understep{N6. Check $e$.} Underflow if $e < 0$.\cr
\\
60⊗N7⊗ENTX⊗0,2⊗\understep{N7. Pack.} $\rX ← e$.\cr
61⊗⊗SRC⊗1\cr
62⊗ZRO⊗DEC2⊗BYTE⊗$\rI2 ← e - b$.\cr
\\
63⊗8H⊗STA⊗ACC\cr
64⊗EXITF⊗J2N⊗*⊗Exit, unless $e ≥ b$.\cr
65⊗EXPOV⊗HLT⊗2⊗Exponent overflow detected\cr
66⊗EXPUN⊗HLT⊗1⊗Exponent underflow detected\cr
67⊗ACC⊗CON⊗0⊗Floating-point accumulator\quad\blackslug\cr}}
%folio 269 galley 3 (C) Addison-Wesley 1978 *
\yyskip The rather long section of code from lines 25 to 37
is needed because \MIX\ has only a 5-byte accumulator for adding
signed numbers while in general $2p + 1 = 9$ places of accuracy
are required by Algorithm A\null. The program could be shortened
to about half its present length if we were willing to sacrifice
a little bit of its accuracy, but we shall see in the next section
that full accuracy is important. Line 55 uses a nonstandard
\MIX\ instruction defined in Section 4.5.2. The running time for
floating-point addition and subtraction depends on several factors
that are analyzed in Section 4.2.4.
Now let us consider multiplication
and division, which are simpler than addition, and which are
somewhat similar to each other.
\algbegin Algorithm M (Floating-point multiplication
or division). Given base $b$, excess $q$,\penalty-300\ $p$-digit, normalized
floating-point numbers $u = (e↓u, f↓u)$ and $v = (e↓v, f↓v)$,
this algorithm forms the product $w = u \otimes v$ or the quotient
$w = u \odiv v$.
\algstep M1. [Unpack.] Separate the exponent
and fraction parts of the representations of $u$ and $v$.\xskip (Sometimes
it is convenient, but not necessary, to test the operands for
zero during this step.)
\algstep M2. [Operate.] Set
$$\vcenter{\halign{\lft{$#$}\quad⊗\lft{$#$}\quad⊗\lft{#}\cr
e↓w ← e↓u + e↓v - q,⊗f↓w ← f↓u\,f↓v⊗for multiplication;\cr
\noalign{\vskip 3pt}
e↓w ← e↓u - e↓v + q + 1,⊗f↓w ← (b↑{-1}f↓u)/f↓v⊗for division.\cr}}\eqno(9)$$
(Since the input numbers are assumed to
be normalized, it follows that either $f↓w = 0$, or $1/b↑2 ≤
|f↓w| < 1$, or a division-by-zero error has occurred.)\xskip If necessary,
the representation of $f↓w$ may be reduced to $p+2$ or $p+3$
digits at this point, as in exercise 5.
\algstep M3. [Normalize.] Perform Algorithm N on
$(e↓w, f↓w)$ to normalize, round, and pack the result.\xskip [{\sl
Note:} Normalization is simpler in this case, since scaling
left occurs at most once, and rounding overflow cannot occur
after division.]\quad\blackslug
\yyskip The following \MIX\ subroutines,
which are intended to be used in connection with Program A\null, illustrate
the machine considerations necessary in connection with Algorithm
M.
\algbegin Program M (Floating-point multiplication and division).
\yyskip
{\mixfour{01⊗Q⊗EQU⊗BYTE/2⊗$q$ is half the byte size\cr
02⊗FMUL⊗STJ⊗EXITF⊗Floating-point multiplication subroutine:\cr
03⊗⊗JOV⊗OFLO⊗Ensure overflow is off.\cr
04⊗⊗STA⊗TEMP⊗$\.{TEMP} ← v$.\cr
05⊗⊗LDX⊗ACC⊗$\rX ← u$.\cr
06⊗⊗STX⊗FU(0:4)⊗$\.{FU} ← \pm\,f\,f\,f\,f\,0$.\cr
\\
07⊗⊗LD1⊗TEMP(EXP)\cr
08⊗⊗LD2⊗ACC(EXP)\cr
09⊗⊗INC2⊗-Q,1⊗$\rI2 ← e↓u + e↓v - q$.\cr
\\
10⊗⊗SLA⊗1\cr
11⊗⊗MUL⊗FU⊗Multiply $f↓u$ times $f↓v$.\cr
12⊗⊗JMP⊗NORM⊗Normalize, round, and exit.\cr
\noalign{\penalty-300\vskip 3pt plus 6pt minus 2pt}
13⊗FDIV⊗STJ⊗EXITF⊗Floating-point division subroutine:\cr
14⊗⊗JOV⊗OFLO⊗Ensure overflow is off.\cr
15⊗⊗STA⊗TEMP⊗$\.{TEMP} ← v$.\cr
16⊗⊗STA⊗FV(0:4)⊗$\.{FV} ← \pm\,f\,f\,f\,f\,0$.\cr
\\
17⊗⊗LD1⊗TEMP(EXP)\cr
18⊗⊗LD2⊗ACC(EXP)\cr
19⊗⊗DEC2⊗-Q,1⊗$\rI2 ← e↓u - e↓v + q$.\cr
\\
20⊗⊗ENTX⊗0\cr
21⊗⊗LDA⊗ACC\cr
22⊗⊗SLA⊗1⊗$\rA ← f↓u$.\cr
23⊗⊗CMPA⊗FV(1:5)\cr
24⊗⊗JL⊗*+3⊗Jump if $|f↓u| < |f↓v|$.\cr
25⊗⊗SRA⊗1⊗Otherwise, scale $f↓u$ right\cr
26⊗⊗INC2⊗1⊗\qquad and increase rI2 by 1.\cr
\\
27⊗⊗DIV⊗FV⊗Divide.\cr
28⊗⊗JNOV⊗NORM⊗Normalize, round, and exit.\cr
29⊗DVZRO⊗HLT⊗3⊗Unnormalized or zero divisor.\quad\blackslug\cr}}
\yyskip The most noteworthy feature of this program
is the provision for division in lines 23--26, which is made
in order to ensure enough accuracy to round the answer. If $|f↓u|
< |f↓v|$, straightforward application of Algorithm M would leave
a result of the form ``$\pm\,0\,f\,f\,f\,f$'' in register A, and
this would not allow a proper rounding without a careful analysis
of the remainder (which appears in register X). So the program
computes $f↓w ← f↓u/f↓v$ in this case, ensuring that $f↓w$ is
either zero or normalized in all cases; rounding can proceed
with five significant bytes, possibly testing whether the remainder
is zero.
\yskip We occasionally need to convert values between fixed-
and floating-point representations. A ``fix-to-float'' routine
is easily obtained with the help of the normalization algorithm
above; for example, in \MIX, the following subroutine converts
an integer to floating-point form:
$$\vcenter{\mixfour{01⊗FLOT⊗STJ⊗EXITF⊗Assume that $\rA=u$, an integer.\cr
02⊗⊗JOV⊗OFLO⊗Ensure overflow is off.\cr
03⊗⊗ENT2⊗Q+5⊗Set raw exponent.\cr
04⊗⊗ENTX⊗0\cr
05⊗⊗JMP⊗NORM⊗Normalize, round, and exit.\quad\blackslug\cr}}\eqno(10)$$
A ``float-to-fix'' subroutine is the subject of exercise 14.
%folio 273 galley 4 (C) Addison-Wesley 1978 *
\yskip The debugging of floating-point subroutines is usually
a difficult job, since there are so many cases to consider.
Here is a list of common pitfalls that often trap a programmer
or machine designer who is preparing floating-point routines:
\yskip\textindent{1)}{\sl Losing the sign.}\xskip On many machines (not \MIX),
shift instructions between registers will affect the sign, and
the shifting operations used in normalizing and scaling numbers
must be carefully analyzed. The sign is also lost frequently
when minus zero is present.\xskip (For example, Program A is careful
to retain the sign of register A in lines 30--34. See also exercise
6.)
\yskip\textindent{2)}{\sl Failure to treat exponent underflow or
overflow properly.}\xskip The size of $e↓w$ should not be checked
until {\sl after} the rounding and normalization, because preliminary
tests may give an erroneous indication. Exponent underflow and
overflow can occur on floating-point addition and subtraction,
not only during multiplication and division; and even though
this is a rather rare occurrence, it must be tested each time.
Enough information should be retained that meaningful corrective
actions are possible after overflow or underflow has occurred.
It has unfortunately become customary in many instances
to ignore exponent underflow and simply to set underflowed results
to zero with no indication of error. This causes a serious loss
of accuracy in most cases (indeed, it is the loss of {\sl all}
the significant digits), and the assumptions underlying floating-point
arithmetic have broken down, so the programmer really must be
told when underflow has occurred. Setting the result to zero
is appropriate only in certain cases when the result is later to be
added to a significantly larger quantity. When exponent underflow
is not detected, we find mysterious situations in which $(u
\otimes v) \otimes w$ is zero, but $u \otimes (v \otimes w)$
is not, since $u \otimes v$ results in exponent underflow but
$u \otimes (v \otimes w)$ can be calculated without any exponents
falling out of range. Similarly, we can find positive numbers
$a$, $b$, $c$, $d$, and $y$ such that
$$\eqalign{(a \otimes y \;\oplus\; b) \odiv (c \otimes y
\;\oplus\; d) ⊗\; \approx\;\textstyle {2\over 3},\cr
\noalign{\vskip 3pt}
(a \;\oplus\; b \odiv y) \odiv (c \;\oplus\;d \odiv y) ⊗\;=\; 1\cr}\eqno(11)$$
if exponent underflow is not detected.\xskip (See exercise
9.)\xskip Even though floating-point routines are not precisely accurate,
such a disparity as (11) is certainly unexpected when $a$, $b$, $c$,
$d$, and $y$ are all {\sl positive}\/! Exponent underflow is usually
not anticipated by a programmer, so he needs to be told about
it.
\yskip\textindent{3)}{\sl Inserted garbage.}\xskip When scaling to the
left it is important to keep from introducing anything but zeros
at the right. For example, note the ``\.{ENTX 0}'' instruction in
line 21 of Program A\null, and the all-too-easily-forgotten ``\.{ENTX
0}'' instruction in line 04 of the \.{FLOT} subroutine (10).\xskip (But
it would be a mistake to clear register X after line 27 in the
division subroutine.)
\yskip\textindent{4)}{\sl Unforeseen rounding overflow.}\xskip When a number
like .999999997 is rounded to 8 digits, a carry will occur to
the left of the decimal point, and the result must be scaled
to the right. Many people have mistakenly concluded that rounding
overflow is impossible during multiplication, since they look
at the maximum value of $|f↓uf↓v|$, which is $1 - 2b↑{-p} + b↑{-2p}$;
and this cannot round up to 1. The fallacy in this reasoning
is exhibited in exercise 11. Curiously, the phenomenon of rounding
overflow {\sl is} impossible during floating-point division (see
exercise 12).
({\sl Note:} There is one school of thought that
says it is harmless to ``round'' .999999997 to .99999999 instead
of to 1.0000000, since this does not increase the worst-case
bounds on relative error. Furthermore the floating-point number
1.0000000 may be said to represent all real values in the interval
$[1.0000000 - 5 \times 10↑{-8}, 1.0000000 + 5 \times 10↑{-8}]$,
while .99999999 represents all values in the much smaller interval
$(.99999999 - 5 \times 10↑{-9}, .99999999 + 5 \times 10↑{-9})$.
Even though the latter interval does not contain the original
value .999999997, each number of the second interval is contained
in the first, so subsequent calculations with the second interval
are no less accurate than with the first. But this argument
is incompatible with the mathematical philosophy of floating-point
arithmetic expressed in Section 4.2.2.)
\yskip\textindent{5)}{\sl Rounding before normalizing.}\xskip Inaccuracies
are caused by premature rounding in the wrong digit position.
This error is obvious when rounding is being done to the left
of the appropriate position; but it is also dangerous in the
less obvious cases where rounding is first done too far to the
right and followed by rounding in the true position. For
this reason it is a mistake to round during the ``scaling-right''
operation in step A5, except as prescribed in exercise 5.\xskip
(The special case of rounding in step N5, then rounding again
after rounding overflow has occurred, is harmless, however, because rounding
overflow always yields $\pm 1.0000000$ and this is unaffected by the subsequent
rounding process.)
\yskip\textindent{6)}{\sl Failure to retain enough precision in intermediate
calculations.}\xskip Detailed analyses of the accuracy of floating-point
arithmetic, made in the next section, suggest strongly that
normalizing floating-point routines should always deliver a
properly rounded result to the maximum possible accuracy. There
should be no exceptions to this dictum, even in cases that
occur with extremely low probability; the appropriate number
of significant digits should be retained throughout the computations,
as stated in Algorithms A and M.
\subsectionbegin{C. Floating-point hardware} Nearly
every large computer intended for scientific calculations includes
floating-point arithmetic as part of its repertoire of built-in
operations. Unfortunately, the design of such hardware usually
includes some anomalies that result in dismally poor behavior
in certain circumstances, and we hope that future computer designers
will pay more attention to providing the proper behavior than
they have in the past. It costs only a little more to build the
machine right, and considerations in the following section show
that substantial benefits will be gained.
The \MIX\ computer, which is being used as an example
of a ``typical'' machine in this series of books, has an optional
``floating-point attachment'' (available at extra cost) that
includes the following seven operations:
\yyskip\noindent $\bullet$ \.{FADD}, \.{FSUB}, \.{FMUL}, \.{FDIV},
\.{FLOT}, \.{FCMP} (C = 1, 2, 3, 4, 5, 56, respectively; \hbox{F = 6}).
\xskip The contents of rA after the operation ``\.{FADD V}'' are
precisely the same as the contents of rA after the operations
$$\vbox{\mixtwo{STA⊗ACC\cr LDA⊗V\cr JMP⊗FADD\cr}}$$
where \.{FADD} is the subroutine that appears earlier
in this section, except that both operands are automatically
normalized before entry to the subroutine if they are not already
in normalized form.\xskip (If exponent underflow occurs during this
pre-normalization, but not during the normalization of the answer,
no underflow is signalled.)\xskip Similar remarks apply to \.{FSUB}, \.{FMUL},
and \.{FDIV}. The contents of rA after the operation ``\.{FLOT}'' are
the contents after ``\.{JMP FLOT}'' in the subroutine (10) above.
The contents of rA are unchanged by the operation
``\.{FCMP V}''; this instruction sets the comparison indicator to
less, equal, or greater, depending on whether the contents of
rA are ``definitely less than,'' ``approximately equal to,''
or ``definitely greater than'' \.V; this subject is discussed
in the next section, and the precise action is defined by the
subroutine \.{FCMP} of exercise 4.2.2--17 with \.{EPSILON} in location 0.
No register other than rA is affected by any of
the floating-point operations. If exponent overflow or underflow
occurs, the overflow toggle is turned on and the exponent of
the answer is given modulo the byte size. Division by zero leaves
undefined garbage in rA. Execution times: $4u$,
$4u$, $9u$, $11u$, $3u$, $4u$, respectively.
\yskip\noindent $\bullet$ \.{FIX} (C = 5; F = 7).\xskip The contents of
rA are replaced by the integer ``round(rA)'', rounding to the nearest integer
as in step N5 of Algorithm N\null.
However, if this answer is too large to fit in the register,
the overflow toggle is set on and the result is undefined. Execution time: $3u$.
\yskip Sometimes it is helpful to use floating-point operators
in a nonstandard way. For example, if the operation \.{FLOT}
had not been included as part of \MIX's floating-point attachment,
we could easily achieve its effect on 4-byte numbers by writing
$$\vcenter{\mixthree{FLOT⊗STJ⊗9F\cr
⊗SLA⊗1\cr
⊗ENTX⊗Q+4\cr
⊗SRC⊗1\cr
⊗FADD⊗=0=\cr
9H⊗JMP⊗*⊗\quad\blackslug\cr}}\eqno(12)$$
This routine is not strictly equivalent to
the \.{FLOT} operator, since it assumes that the 1:1 byte of rA
is zero, and it destroys rX. The handling of more general situations
is a little tricky, because rounding overflow can occur even
during a \.{FLOT} operation.
%folio 276 galley 5 (C) Addison-Wesley 1978 *
Similarly, suppose \MIX\ had a \.{FADD} operation but not \.{FIX}.
If we wanted to round a number $u$ from floating-point
form to the nearest fixed-point integer, and if we knew that
the number was nonnegative and would fit in at most three bytes, we
could write
$$\.{FADD\quad FUDGE}$$
where \.{FUDGE} contains the constant
$$\def\\{\vrule height 12.4pt depth 7.4pt}
\vcenter{\hbox to 154pt{\hfill\vbox{\baselineskip0pt\lineskip0pt
\hrule
\hbox{\\ \hbox to 25pt{\hfill\.+\hfill\\}\!
\hbox to 25pt{\hfill\.{Q+4}\hfill\\}\!
\hbox to 25pt{\hfill\.1\hfill\\}\!
\hbox to 25pt{\hfill\.0\hfill\\}\!
\hbox to 25pt{\hfill\.0\hfill\\}\!
\hbox to 25pt{\hfill\.0\hfill\\}}
\hrule}\hfill}};$$
the result in rA would be
$$\def\\{\vrule height 12.4pt depth 7.4pt}
\def\¬{\vrule height 3pt}
\vcenter{\baselineskip0pt\lineskip0pt
\hbox to 154pt{\hfill\vbox{
\hrule
\hbox{\vbox{\hbox{\\}\vfill}\vbox{\hbox to 25pt{\hfill\.+\hfill\\}\vfill}\!
\vbox{\hbox to 25pt{\hfill\.{Q+4}\hfill\\}\vfill}\!
\vbox{\hbox to 25pt{\hfill\.1\hfill\\}\vfill}\!
\vbox{
\hbox to 75pt{\hfill\¬\hfill\¬\hfill\¬}
\hbox to 75pt{\hfill round($u$)\hfill \vrule height 9.4pt depth 4.4pt}
\hbox to 75pt{\hfill\¬\hfill\¬\hfill\¬}}}
\hrule}\hfill}}.\eqno(13)$$
Some computers have floating-point
hardware that suppresses automatic\penalty-200\ rounding and gives additional
information about the less-significant digits that would ordinarily
be rounded off. Such facilities are of use in extending the
precision of floating-point calculations, but their implementation
is usually subject to unfortunate anomalies.
\subsectionbegin{D. History and bibliography} The origins
of floating-point notation can be traced back to Babylonian mathematicians
(1800 {\:m B.C.} or earlier), who made extensive
use of radix-60 floating-point arithmetic but did not have a
notation for the exponents. The appropriate exponent was always
somehow ``understood'' by the man doing the calculations. At
least one case has been found in which the wrong answer was
given because addition was performed with improper alignment
of the operands, but such examples are very rare; see O. Neugebauer,
{\sl The Exact Sciences in Antiquity} (Princeton, N. J.: Princeton University
Press, 1952), 26--27. Another early contribution to floating-point
notation is due to the Greek mathematician Apollonius (3rd century
{\:m B.C.}), who apparently was
the first to explain how to simplify multiplication by collecting
powers of 10 separately from their coefficients, at least in
simple cases.\xskip [For a discussion of Apollonius's method, see
Pappus, {\sl Mathematical Collections} (4th century {\:m A.D.}).]\xskip
After the Babylonian civilization
died out, the first significant uses of floating-point notation
for products and quotients did not emerge until much later,
about the time logarithms were invented (1600) and shortly afterwards
when Oughtred invented the slide rule (1630). The modern notation
``$\,x↑n\,$'' for exponents was being introduced at about the same
time; separate symbols for $x$ squared, $x$ cubed, etc.\ had
been in use before this.
Floating-point arithmetic was incorporated into
the design of some of the earliest computers. It was independently
proposed by Leonardo Torres y Quevedo in Madrid, 1914; by Konrad
Zuse in Berlin, 1936; and by George Stibitz in New Jersey, 1939.
Zuse's machines used a floating-binary representation that
he called ``semi-logarithmic notation''; he also incorporated
conventions for dealing with special quantities like ``$∞$'' and
``undefined.'' The first American computers to operate with
floating-point arithmetic hardware were the Bell Laboratories'
Model V and the Harvard Mark II, both of which were relay calculators
designed in 1944.\xskip [See B. Randell, {\sl The Origins of Digital
Computers} (Berlin: Springer, 1973), pp.\ 100, 155, 163--164,
259--260; {\sl Proc.\ Symp.\ on Large-Scale Digital Calculating
Machinery} (Harvard, 1947), 41--68, 69--79; {\sl Datamation
\bf 13} (April 1967), 35--44, (May 1967), 45--49; {\sl Zeitschrift
f\"ur angew.\ Math.\ und Physik \bf 1} (1950),
345--346.]
The use of floating-binary arithmetic
was seriously considered in 1944--1946 by researchers at the
Moore School in their plans for the first {\sl electronic} digital
computers, but it turned out to be much harder to implement
floating-point circuitry with tubes than with relays. The group
realized that scaling is a problem in programming, but felt
that it is only a very small part of a total programming job
which is usually worth the time and trouble it takes, since
it tends to keep a programmer aware of the numerical accuracy
he is getting. Furthermore, they argued that floating-point
representation takes up valuable memory space, since the exponents
must be stored, and that it becomes difficult to adapt floating-point
arithmetic to multiple-precision calculations.\xskip [See von Neumann's
{\sl Collected Works} {\bf 5} (New York: Macmillan, 1963), 43,
73--74.]\xskip At this time, of course, they were designing the first
stored-program computer and the second electronic computer,
and their choice had to be either fixed-point {\sl or} floating-point
arithmetic, not both. They anticipated the coding of floating-binary
routines, and in fact ``shift-left'' and ``shift-right'' instructions
were put into their machine primarily to make such routines
more efficient. The first machine to have both kinds of arithmetic
in its hardware was apparently a computer developed at General
Electric Company [see
{\sl Proc.\ 2nd Symp.\ on Large-Scale
Digital Calculating Machinery} (Cambridge: Harvard University
Press, 1951), 65--69].
Floating-point subroutines and interpretive systems
for early machines were coded by D. J. Wheeler and others, and
the first publication of such routines was in {\sl The Preparation
of Programs for an Electronic Digital Computer} by Wilkes, Wheeler,
and Gill (Reading, Mass.: Addison-Wesley, 1951), subroutines
A1--A11, pp.\ 35--37, 105--117. It is interesting to note that
floating-{\sl decimal} subroutines are described here, although
a binary computer was being used; in other words, the numbers
were represented as $10↑ef$, not $2↑ef$, and therefore the scaling
operations required multiplication or division by 10. On this
particular machine such decimal scaling was about as easy as
shifting, and the decimal approach greatly simplified input-output
conversions.
Most published references to the details of floating-point
arithmetic routines are scattered in ``technical memorandums''
distributed by various computer manufacturers, but there have
been occasional appearances of these routines in the open literature.
Besides the reference above, see R. H. Stark and D. B. MacMillan,
{\sl Math.\ Comp.\ \bf 5} (1951), 86--92, where a plugboard-wired
program is described; D. McCracken, {\sl Digital Computer Programming}
(New York: Wiley, 1957), 121--131; J. W. Carr III, {\sl CACM
\bf 2} (May 1959), 10--15; W. G. Wadey, {\sl JACM \bf 7} (1960),
129--139; D. E. Knuth, {\sl JACM \bf 8} (1961), 119--128; O.
Kesner, {\sl CACM \bf 5} (1962), 269--271; F. P. Brooks and
K. E. Iverson, {\sl Automatic Data Processing} (New York: Wiley,
1963), 184--199. For a discussion of floating-point arithmetic
from a computer designer's standpoint, see ``Floating-point
operation'' by S. G. Campbell, in {\sl Planning a computer System,}
ed.\ by W. Buchholz (New York: McGraw-Hill, 1962), 92--121. Additional
references are given in Section 4.2.2.
\exbegin{EXERCISES}
\exno 1. [10] How would
Avogadro's number and Planck's constant be represented in base
100, excess 50, four-digit floating-point notation? $\biglp$This would
be the representation used by \MIX, as in (4), if the byte size is 100.$\bigrp$
%folio 282 galley 6 (C) Addison-Wesley 1978 *
\exno 2. [12] Assume that the exponent
$e$ is constrained to lie in the range $0 ≤ e ≤ E$; what are
the largest and smallest positive values that can be written
as base $b$, excess $q$, $p$-digit floating-point numbers? What
are the largest and smallest positive values that can be written
as {\sl normalized} floating-point numbers with these specifications?
\exno 3. [11] (K. Zuse, 1936.)\xskip Show that
if we are using normalized floating-binary
arithmetic, there is a way to increase the precision slightly
without loss of memory space: A $p$-bit fraction part can be
represented using only $p - 1$ bit positions of a computer word,
if the range of exponent values is decreased very slightly.
\trexno 4. [12] Assume that $b = 10$, $p = 8$. What result does
Algorithm A give for $(50, +.98765432) \oplus (49, +.33333333)$?
For $(53, -.99987654) \oplus (54, +.10000000)$? For $(45, -.50000001)
\oplus (54, +.10000000)$?
\trexno 5. [24] Let us say that $x ~ y$ (with respect to a given
radix $b$) if $x$ and $y$ are real numbers satisfying the following
conditions:
$$\eqalign{\lfloor x/b\rfloor ⊗= \lfloor y/b\rfloor;\cr
\noalign{\vskip 2pt}
x\mod b = 0\qquad⊗\hbox{iff}\qquad y \mod b = 0;\cr
\noalign{\vskip 2pt}
\textstyle 0 < x\mod b < {1\over 2}b\qquad⊗\hbox{iff}\qquad
\textstyle 0 < y \mod b < {1\over 2}b;\cr
\noalign{\vskip 2pt}
\textstyle x \mod b = {1\over 2}b\qquad⊗\hbox{iff}\qquad
\textstyle y \mod b = {1\over 2}b;\cr
\noalign{\vskip 2pt}
\textstyle {1\over 2}b < x \mod b < b\qquad⊗\hbox{iff}\qquad
\textstyle {1\over 2}b < y \mod b < b.\cr}$$
Prove that if $f↓v$ is replaced by $b↑{-p-2}F↓v$
between steps A5 and A6 of Algorithm A\null, where $F↓v ~ b↑{p+2}f↓v$,
the result of that algorithm will be unchanged.\xskip (If $F↓v$ is
an integer and $b$ is even, this operation essentially truncates
$f↓v$ to $p + 2$ places while remembering whether any nonzero
digits have been dropped, thereby minimizing the length of register
that is needed for the addition in step A6.)
\exno 6. [20] If the result of a \.{FADD} instruction is zero, what
will be the sign of rA, according to the definitions of \MIX's
floating-point attachment given in this section?
\exno 7. [27] Discuss floating-point arithmetic
using balanced ternary notation.
\exno 8. [20] Give examples of normalized eight-digit floating-decimal
numbers $u$ and $v$ for which addition yields\xskip (a) exponent underflow,\xskip
(b) exponent overflow, assuming that exponents must satisfy
$0 ≤ e < 100$.
\exno 9. [M24] (W. M. Kahan.)\xskip Assume
that the occurrence of exponent underflow causes the result
to be replaced by zero, with no error indication given. Using
excess zero, eight-digit floating-decimal numbers with $e$ in
the range $-50 ≤ e < 50$, find positive values of $a$, $b$,
$c$, $d$, and $y$ such that (11) holds.
\exno 10. [12] Give an example of normalized eight-digit floating-decimal
numbers $u$ and $v$ for which rounding overflow occurs in addition.
\trexno 11. [M20] Give an example of normalized, excess 50, eight-digit
floating-decimal numbers\-\ $u$ and $v$ for which rounding overflow
occurs in multiplication.
\exno 12. [M25] Prove that rounding overflow cannot occur during
the normalization phase of floating-point division.
\def\upper#1{\mathop{\hbox to 8.889pt{\:u\char'156\hskip-8.889pt\hfill
$\vcenter{\hbox{$\scriptscriptstyle#1$}}$\hfill}}}
\def\lwr#1{\mathop{\hbox to 8.889pt{\:u\char'157\hskip-8.889pt\hfill
$\vcenter{\hbox{$\scriptscriptstyle#1$}}$\hfill}}}
\exno 13. [30] When doing ``interval arithmetic'' we don't want
to round the results of a floating-point computation; we want
rather to implement operations such as $\lwr+$ and $\upper+$, which
give the tightest possible representable bounds on the true
sum: $$u \lwr+ v \;≤\; u + v \;≤\; u \upper+ v.$$ How should the algorithms
of this section be modified for such a purpose?
\exno 14. [25] Write a \MIX\ subroutine that begins with an arbitrary
floating-point number in register A, not necessarily normalized,
and that converts it to the nearest fixed-point integer (or
determines that it is too large in absolute value to make such
a conversion possible).
\def\omod{\hbox to 25pt{\hfill$\vcenter{\hbox{\:@\char'140}}$\hfill}\hskip-25pt
\raise .3pt\hbox to 25pt{\hfill$\vcenter{\moveleft .2pt\hbox{\:emod}}$\hfill}}
\trexno 15. [28] Write a \MIX\ subroutine, to be used in connection
with the other subroutines of this section, that calculates
$u\omod 1$, that is, $u - \lfloor u\rfloor$ rounded to nearest
floating-point number, given a floating-point number $u$. Note
that when $u$ is a very small negative number, $u\omod 1$ will
be rounded so that the result is unity (even though $u \mod
1$ has been defined to be always {\sl less} than unity,
as a real number).
\exno 16. [HM21] (Robert L. Smith.)\xskip Design an algorithm to compute
the real and imaginary parts of the complex number $(a + bi)/(c
+ di)$, given real floating-point values $a$, $b$, $c$, and $d$.
Avoid the computation of $c↑2 + d↑2$, since it would cause floating-point
overflow even when $|c|$ or $|d|$ is approximately the
square root of the maximum allowable floating-point value.
\exno 17. [40] (John Cocke.)\xskip Explore the idea of extending the
range of floating-point numbers by defining a single-word representation
in which the precision of the fraction decreases as the magnitude
of the exponent increases.
\exno 18. [25] Consider a binary computer with 36-bit words,
on which positive floating-binary numbers are represented as
$(0e↓1e↓2 \ldotsm e↓8f↓1f↓2 \ldotsm f↓{27})↓2$; here $(e↓1e↓2
\ldotsm e↓8)↓2$ is an excess $(10000000)↓2$ exponent and $(f↓1f↓2
\ldotsm f↓{27})↓2$ is a 27-bit fraction. Negative floating-point
numbers are represented by the {\sl two's complement} of the corresponding
positive representation (see Section 4.1). Thus, 1.5 is {\it 201$\,|$600000000}
in octal notation, while $-1.5$ is {\it 576$\,|$200000000\/}; the
octal representations of 1.0 and $-1.0$ are {\it 201$\,|$400000000} and
{\it 576$\,|$400000000}, respectively.\xskip (A vertical line is used here to show
the boundary between exponent and fraction.)\xskip Note that bit $f↓1$ of a
normalized positive number is always 1, while it is almost always
zero for negative numbers; the exceptional cases are representations
of $-2↑k$.
Suppose that the exact result of a floating-point
operation has the octal code {\it 572$\,|$740000000$\,|$01\/};
this (negative) 33-bit fraction must be normalized and rounded
to 27 bits. If we shift left until the leading fraction bit
is zero, we get {\it 576$\,|$000000000$\,|$20}, but this
rounds to the illegal value {\it 576$\,|$000000000\/}; we
have over-normalized, since the correct answer is {\it 575$\,|$400000000}.
On the other hand if we start (in some other problem) with the value
{\it 572$\,|$740000000$\,|$05} and stop before over-normalizing
it, we get {\it 575$\,|$400000000$\,|$50}, which rounds
to the unnormalized number {\it 575$\,|$400000001\/}; subsequent
normalization yields {\it 576$\,|$000000002} while the correct
answer is {\it 576$\,|$000000001}.
Give a simple, correct rounding rule that resolves
this dilemma on such a machine (without abandoning two's complement
notation).
\exno 19. [24] What is the running time for the \.{FADD} subroutine
in Program A\null, in terms of relevant characteristics of the data?
What is the maximum running time, over all inputs that do not
cause overflow or underflow?
%folio 285 galley 7 WARNING: Much tape unreadable! (C) Addison-Wesley 1978 *
\runningrighthead{ACCURACY OF FLOATING-POINT ARITHMETIC}
\section{4.2.2}
\sectionskip
\sectionbegin{4.2.2. Accuracy of Floating-Point Arithmetic}
Floating-point computation is by nature
inexact, and it is not difficult to misuse it so that the computed
answers consist almost entirely of ``noise.'' One of the principal
problems of numerical analysis is to determine how accurate
the results of certain numerical methods will be. A ``credibility-gap''
problem is involved here: we don't know how much of the computer's
answers to believe. Novice computer users solve this problem
by implicitly trusting in the computer as an infallible authority;
they tend to believe that all digits of a printed answer are
significant. Disillusioned computer users have just the opposite
approach, they are constantly afraid that their answers are almost
meaningless. Many a serious mathematician has attempted to give
rigorous analyses of a sequence of floating-point operations,
but has found the task to be so formidable that he has tried to content
himself with plausibility arguments instead.
A thorough examination of error analysis techniques
is, of course, beyond the scope of this book, but we will in
this section study some of the characteristics of floating-point
arithmetic errors. Our goal is to discover how to perform floating-point
arithmetic in such a way that reasonable analyses of error propagation
are facilitated as much as possible.
A rough (but reasonably useful) way to express
the behavior of floating-point arithmetic can be based on the
concept of ``significant figures'' or {\sl relative error.}
If we are representing an exact real number $x$ inside a computer
by using the approximation $\A x = x(1 + ε)$, the quantity
$ε = (\A x - x)/x$ is called the relative error of approximation.
Roughly speaking, the operations of floating-point multiplication
and division do not magnify the relative error by very much;
but floating-point subtraction of nearly equal quantities (and
floating-point addition, $u \oplus v$, where $u$ is nearly equal
to $-v$) can very greatly increase the relative error. So we
have a general rule of thumb, that a substantial loss of accuracy
is expected from such additions and subtractions, but not from
multiplications and divisions. On the other hand, the situation
is somewhat paradoxical and needs to be understood properly,
since ``bad'' additions and subtractions are performed with
perfect accuracy! (See exercise 25.)
\topinsert{\quoteformat{Round numbers are always false.\cr}
\author{SAMUEL JOHNSON (1750)}
\quoteformat{I shall speak in round numbers, not absolutely accurate,\cr
yet not so wide from truth as to vary the result materially.\cr}
\author{THOMAS JEFFERSON (1824)}}
One of the consequences of the possible unreliability
of floating-point addition is that the associative law breaks
down:
$$(u \oplus v) \oplus w ≠ u \oplus (v \oplus w),\qquad\hbox{for many }
u, v, w.\eqno (1)$$
For example,
$$\eqalign{(11111113. \oplus -11111111.) \oplus 7.5111111 ⊗= 2.0000000
\oplus 7.5111111\cr
⊗= 9.5111111;\cr
\noalign{\vskip 3pt}
11111113. \oplus (-11111111. \oplus 7.5111111) ⊗= 11111113.
\oplus -11111103.\cr
⊗= 10.000000.\cr}$$
(All examples in this section are given in eight-digit floating-decimal
arithmetic, with exponents indicated by an explicit decimal point. Recall
that, as in Section 4.2.1, the symbols $\oplus$, $\ominus$, $\otimes$,
$\odiv$ are used to stand for floating-point operations corresponding to
the exact operations $+$, $-$, $\times$, $/$.)
In view of the failure of the associative law,
the comment of Mrs. La Touche that appears at the beginning
of this chapter [taken from {\sl Math.\ Gazette \bf 12} (1924),
95] makes a good deal of sense with respect to floating-point
arithmetic. Mathematical notations like ``$a↓1 + a↓2 + a↓3$''
or ``$\sum ↓{1≤k≤n} a↓k$'' are inherently based upon the assumption
of associativity, so a programmer must be especially careful
that he does not implicitly assume the validity of the associative
law.
\subsectionbegin{A. An axiomatic approach} Although
the associative law is not valid, the commutative law
$$u \oplus v = v \oplus u\eqno (2)$$
does hold, and this law can be a valuable conceptual
asset in programming and in the analysis of programs. This example
suggests that we should look for important laws that {\sl are}
satified by $\oplus$, $\ominus$, $\otimes$, and $\odiv$; it is
not unreasonable to say that {\sl floating-point routines should
be designed to preserve as many of the ordinary mathematical
laws as possible.}
Let us therefore consider some of the other basic laws
that are valid for normalized floating-point operations as
described in the previous section. First we have
$$\eqalignno{u\ominus v⊗=u\oplus -v;⊗(3)\cr
\noalign{\vskip 4pt}
-(u \oplus v)⊗= -u \oplus -v;⊗(4)\cr
\noalign{\vskip 4pt}
u\oplus v=0\qquad\hbox{if a}⊗\hbox{nd only if}\qquad v=-u;⊗(5)\cr
\noalign{\vskip 4pt}
u\oplus 0⊗=u.⊗(6)\cr}$$
From these laws we can derive further identities; for example (exercise 1),
$$u\ominus v=-(v\ominus u).\eqno(7)$$
Identities (2) to (6) are easily deduced from the algorithms in Section 4.2.1.
The following rule is slightly less obvious:
$$\hbox{if}\qquad u≤v\qquad\hbox{then}\qquad u\oplus w≤v\oplus w.\eqno(8)$$
Instead of attempting to prove this rule by analyzing Algorithm 4.2.1A\null, let us
go back to the principle underlying the design of that algorithm.\xskip
(Algorithmic proofs aren't always easier than mathematical ones.)\xskip Our idea was
that the floating-point operations should satisfy
$$\eqalign{u\oplus v⊗=\hbox{round}(u+v),\cr
u\otimes v⊗=\hbox{round}(u\times v),\cr}\qquad
\eqalign{u\ominus v⊗=\hbox{round}(u-v),\cr
u\odiv v⊗=\hbox{round}(u\>\,/\,\>v),\cr}\eqno(9)$$
where round$(x)$ denotes the best floating-point
approximation to $x$ as defined in Algorithm 4.2.1N\null. We have
$$\eqalignno{\hbox{round}(-x)⊗= -\hbox{round}(x),⊗(10)\cr
\noalign{\vskip4pt}
x≤y\qquad\hbox{impli}⊗\hbox{es}\qquad \hbox{round}(x) ≤ \hbox{round}(y),⊗(11)\cr
}$$
and these fundamental relations lead to immediate
proofs of properties (2) through (8). We can also write down
several more identities:
$$\vbox{\halign{\ctr{$#$}\cr
u \otimes v = v \otimes u,\qquad (-u) \otimes v = -(u \otimes
v),\qquad 1 \otimes v = v;\cr
\noalign{\vskip 6pt}
u\otimes v=0\qquad\hbox{if and only if}\qquad
\hbox{$u=0$ or $v=0$;}\cr
\noalign{\vskip 6pt}
(-u) \odiv v = u \odiv (-v) = -(u \odiv v);\cr
\noalign{\vskip 6pt}
0 \odiv v = 0,\qquad u \odiv 1 = u,\qquad u \odiv u = 1.\cr}}$$
If $u ≤ v$ and $w > 0$, then $u \otimes w ≤ v \otimes
w$ and $u \odiv w ≤ v \odiv w$ and $w \odiv u ≥ w \odiv
v$. If $u \oplus v = u + v$, then $(u \oplus v) \ominus v = u$;
and if $u \otimes v = u \times v≠ 0$ then $(u \otimes v) \odiv
v = u$. We see that a good deal of regularity is present
in spite of the inexactness of the floating-point operations,
when things have been defined properly.
%folio 291 galley 8 (C) Addison-Wesley 1978 *
Several familiar rules of algebra are still, of course,
conspicuously absent from the collection of identities above;
the associative law for floating-point multiplication is not
strictly true, as shown in exercise 3, and the distributive
law between $\otimes$ and $\oplus$ can fail rather badly: Let $u
= 20000.000$, $v = -6.0000000$, and $w = 6.0000003$; then
$$\eqalign{(u \otimes v) \oplus (u \otimes w) ⊗= -120000.00
\oplus 120000.01 = .010000000\cr\noalign{\vskip4pt}
u \otimes (v \oplus w) ⊗= 20000.000 \otimes .00000030000000
= .0060000000\cr}$$
so
$$u \otimes (v \oplus w) ≠ (u \otimes v) \oplus (u \otimes w).\eqno(12)$$
On the other hand we do have $b \otimes (v \oplus
w) = (b \otimes v) \oplus (b \otimes w)$, when $b$ is the floating-point radix,
since
$$\hbox{round}(bx) = b\,\hbox{round}(x).\eqno (13)$$
(Strictly speaking, the identities and inequalities
we are considering in this section implicitly assume that
exponent underflow and overflow do not occur. The function round$(x)$
is undefined when $|x|$ is too small or too large, and equations
such as (13) hold only when both sides are defined.)
The failure of Cauchy's fundamental inequality
$$(x↓1↑2 + \cdots + x↓n↑2)(y↓1↑2 + \cdots
+ y↓n↑2) ≥ (x,y↓1 + \cdots + x↓ny↓n)↑2$$ is another important
example of the breakdown of traditional algebra in the presence
of floating-point arithmetic. Exercise 7 shows that Cauchy's inequality can fail
even in the simple case $n=2$, $x↓1=x↓2=1$.
Novice programmers
who calculate the standard deviation of some observations by
using the textbook formula
$$\sigma = \sqrt{\chop to 9pt{\bigglp n \sum↓{1≤k≤n}x↓k↑2-\bigglp\sum
↓{1≤k≤n}x↓k\biggrp↑2\;\biggrp\vcenter{\hbox{\:@\char'36}}n(n - 1)}}\eqno (14)$$
often find themselves taking the square
root of a negative number! A much better way to calculate means
and standard deviations with floating-point arithmetic is to
use the recurrence formulas
$$\vbox{\tabskip 0pt plus 1000pt
\halign to size{\hfill$\dispstyle{#}$\tabskip 0pt
⊗$\dispstyle{\null#}$\hfill\qquad⊗\hfill$\dispstyle{#}$⊗$\dispstyle{\null#}$\hfill
\tabskip 0pt plus 1000pt⊗\hfill$#$\tabskip0pt\cr
M↓1 ⊗= x↓1,⊗M↓k ⊗= M↓{k-1} \oplus (x↓k \ominus
M↓{k-1}) \odiv k,⊗(15)\cr
\noalign{\vskip 4pt}
S↓1 ⊗= 0,⊗ S↓k ⊗= S↓{k-1} \oplus (x↓k
\ominus M↓{k-1}) \otimes (x↓k \ominus M↓k),⊗(16)\cr}}$$
for $2 ≤ k ≤ n$, where $\sigma = \sqrt{S↓n/(n - 1)}$.\xskip
[Cf.\ B. P. Welford, {\sl Technometrics \bf 4} (1962), 419--420.]\xskip
With this method $S↓n$ can never be negative, and we avoid other
serious problems encountered by the na\"\i ve method of accumulating
sums, as shown in exercise 16.\xskip (See exercise 19 for a summation
technique that provides an even better guarantee on the
accuracy.)
Although algebraic laws
do not always hold exactly, we can often show that they aren't
too far off base. When $b↑{e-1} ≤ x < b↑e$ we have round$(x)
= x + \rho (x)$ where $|\rho (x)| ≤ {1\over 2}b↑{e-p}$; hence
$$\hbox{round}(x) = x\biglp 1 + \delta (x)\bigrp ,\eqno (17)$$
where the relative error is bounded independently
of $x$:
$$\textstyle|\delta (x)| ≤ {1\over 2}/(b↑{1-p}+{1\over2})<{1\over2}b↑{1-p}.
\eqno (18)$$
We can use this inequality to estimate the relative
error of normalized floating-point calculations in a simple
way, since $u \oplus v = (u + v)\biglp 1 + \delta (u + v)\bigrp
$, etc.
As an example of typical error-estimation procedures,
let us consider the associative law for multiplication. Exercise
3 shows that $(u \otimes v) \otimes w$ is not in general equal
to $u \otimes (v \otimes w)$; but the situation in this case
is much better than it was with respect to the associative law
of addition (1) and the distributive law (12). In fact, we have
$$\eqalign{(u \otimes v) \otimes w ⊗ = \biglp (uv)(1 + \delta ↓1)\bigrp
\otimes w = uvw(1 + \delta ↓1)(1 + \delta ↓2),\cr\noalign{\vskip 4pt}
u \otimes (v \otimes w) ⊗= u \otimes \biglp (vw)(1 + \delta
↓3)\bigrp = uvw(1 + \delta ↓3)(1 + \delta ↓4),\cr}$$
for some $\delta ↓1$, $\delta ↓2$, $\delta ↓3$, $\delta ↓4$,
provided that no exponent underflow or overflow occurs, where
$|\delta ↓j| < {1\over 2}b↑{1-p}$ for each $j$. Hence
$${(u \otimes v) \otimes w\over u \otimes (v \otimes w)} =
{(1 + \delta ↓1)(1 + \delta ↓2)\over (1 + \delta ↓3)(1 + \delta
↓4)} = 1 + \delta ,$$
where
$$\textstyle|\delta | < 2b↑{1-p}/(1 - {1\over 2}b↑{1-p})↑2.\eqno (19)$$
The number $b↑{1-p}$ occurs so often in such
analyses, it has been given a special name, one {\sl ulp}, meaning
one ``unit in the last place'' of the fraction part. Floating-point
operations are correct to within half an ulp, and the calculation
of $uvw$ by two floating-point multiplications will be correct
within about one ulp (ignoring second-order terms). Hence
the associative law for multiplication holds to within about
two ulps of relative error.
We have shown that $(u \otimes v) \otimes w$ is
approximately equal to $u \otimes (v \otimes w)$, except
when exponent overflow or underflow is a problem. It is worth
while to study this intuitive idea of being ``approximately
equal'' in more detail; can we make such a statement more precise
in a reasonable way?
A programmer using floating-point arithmetic almost
never wants to test if two computed values are exactly equal to each other
(or at least he hardly ever should
try to do so), because this is an extremely improbable occurrence.
For example, if a recurrence relation
$$x↓{n+1} = f(x↓n)$$
is being used, where the theory in some textbook
says that $x↓n$ approaches a limit as $n → ∞$, it is usually a mistake
to wait until $x↓{n+1} = x↓n$ for some $n$, since the sequence
$x↓n$ might be periodic with a longer period due to the rounding
of intermediate results. The proper procedure is to wait until
$|x↓{n+1} - x↓n| < \delta $, for some suitably chosen number
$\delta $; but since we don't necessarily know the order of
magnitude of $x↓n$ in advance, it is even better to wait until
$$|x↓{n+1} - x↓n| ≤ ε|x↓n|;\eqno (20)$$
now $ε$ is a number that is much easier to select.
This relation (20) is another way of saying that $x↓{n+1}$ and
$x↓n$ are approximately equal; and our discussion indicates
that a relation of ``approximately equal'' would be more useful
than the traditional relation of equality, when floating-point computations
are involved, if we could only define a suitable approximation relation.
%folio 293 galley 9 (C) Addison-Wesley 1978 *
In other words, the fact that strict equality of floating-point
values is of little importance implies that we ought to have
a new operation, {\sl floating-point comparison}, which is intended
to help assess the relative values of two floating-point quantities.
The following definitions seem to be appropriate for base $b$,
excess $q$, floating-point numbers $u = (e↓u, f↓u)$ and $v =
(e↓v, f↓v)$:
$$\vbox{\tabskip 0pt plus 1000pt\baselineskip16pt
\halign to size{$u#v\quad(ε)$\qquad if and only if\qquad\tabskip 0pt
⊗\hfill$\dispstyle{#}$⊗$\dispstyle{\null#}$\hfill
\tabskip 0pt plus 1000pt⊗\hfill$#$\tabskip0pt\cr
\prec⊗v-u ⊗> ε \max(b↑{e↓u-q}, b↑{e↓v-q});⊗(21)\cr
~⊗|v-u|⊗≤ ε \max(b↑{e↓u-q}, b↑{e↓v-q});⊗(22)\cr
\succ⊗u-v⊗>ε\max(b↑{e↓u-q},b↑{e↓v-q});⊗(23)\cr
\approx⊗|v-u|⊗≤ε\min(b↑{e↓u-q},b↑{e↓v-q}).⊗(24)\cr}}$$
These definitions apply to unnormalized values
as well as to normalized ones. Note that exactly one of the conditions
$u\prec v$ (definitely less than),
$u ~ v$ (approximately equal to), or $u \succ v$ (definitely greater
than) must always hold for any given pair of values $u$ and
$v$. The relation $u \approx v$ is somewhat stronger than $u
~ v$, and it might be read ``$u$ is essentially equal to $v$.''
All of the relations are given in terms of a positive real number
$ε$ that measures the degree of approximation being considered.
One way to view the above definitions is to associate
a ``neighborhood'' set $N(u)
= \leftset x \relv |x - u| ≤ εb↑{e↓u-q}\rightset$ with each
floating-point number $u$; thus, $N(u)$ represents a set of values near
$u$ based on the exponent of $u$'s floating-point representation.
In these terms, we have $u \prec v$ if and only if $N(u) < v$ and
$u < N(v)$; $u ~ v$ if and only if $u \in N(v)$ or $v \in N(u)$;
$u \succ v$ if and only if $u > N(v)$ and $N(u) > v$; $u \approx v$
if and only if $u \in N(v)$ and $v \in N(u)$.\xskip (Here we are assuming
that the parameter $ε$, which measures the degree of approximation,
is a constant; a more complete notation would indicate the dependence
of $N(u)$ upon $ε$.)
Here are some simple consequences of the above
definitions:
$$\tabskip 0pt plus 1000pt\baselineskip16pt
\halign to size{\hfill#\hfill⊗$#$\tabskip0pt\cr
if\qquad $u \prec v\quad (ε)$\qquad then\qquad $v \succ u\quad (ε)$;⊗(25)\cr
if\qquad $u\approx v\quad (ε)$\qquad then\qquad $u~v\quad (ε)$;⊗(26)\cr
$u \approx u\quad (ε);$⊗(27)\cr
if\qquad $u \prec v\quad (ε)$\qquad then\qquad $u< v;$⊗(28)\cr
if\qquad $u \prec v\quad (ε↓1)\quad$and\quad $ε↓1
≥ ε↓2\qquad$then\qquad $u \prec v\quad (ε↓2);$⊗(29)\cr
if\qquad $u ~ v\quad (ε↓1)\quad$and\quad $ε↓1
≤ ε↓2\qquad$then\qquad $u ~ v\quad (ε↓2);$⊗(30)\cr
if\qquad $u \approx v\quad (ε↓1)\quad$and\quad
$ε↓1 ≤ ε↓2\qquad$then\qquad $u \approx v\quad (ε↓2);$⊗(31)\cr
if\qquad $u \prec v\quad (ε↓1)\quad$and\quad $v \prec
w\quad (ε↓2)\qquad$then\qquad $u \prec w\quad (ε↓1 + ε↓2);$⊗(32)\cr
if\qquad $u \approx v\quad (ε↓1)\quad$and\quad
$v \approx w\quad (ε↓2)\qquad$then\qquad $u ~ w\quad (ε↓1 +
ε↓2).$⊗(33)\cr}$$
Moreover, we can prove without difficulty that
$$\vbox{\tabskip 0pt plus 1000pt\baselineskip16pt
\halign to size{$|u-v|≤ε|u|$\quad\hfill#
\hfill\quad$|u-v|≤ε|v|$\qquad implies\qquad
\tabskip0pt⊗$u#v\quad(ε);$\tabskip 0pt plus 1000pt⊗$#$\tabskip 0pt\cr
and⊗\approx⊗(34)\cr or⊗~⊗(35)\cr}}$$
and conversely, for {\sl normalized} floating-point numbers $u$ and $v$, when
$ε<1$,
$$\vbox{\tabskip 0pt plus 1000pt\baselineskip16pt
\halign to size{$u#v\quad(ε)$\qquad implies\qquad\tabskip0pt
⊗$|u-v|≤bε|u|$\hfill\quad#\quad
\hfill$|u-v|≤bε|v|$⊗#\hfill\tabskip 0pt plus 1000pt⊗#\tabskip0pt\cr
\approx⊗and⊗;⊗(36)\cr ~⊗or⊗.⊗(37)\cr}}$$
Let $ε↓0 = b↑{1-p}$ be one ulp. The derivation of (17) establishes the
inequality $|x - \hbox{round}(x)| = |\rho (x)| <
{1\over 2}ε↓0 \min\biglp|x|, |\hbox{round}(x)|\bigrp$, hence
$$\textstyle x \approx\hbox{round}(x)\quad ({1\over 2}ε↓0);\eqno (38)$$
it follows that $u \oplus v \approx u + v\quad({1\over 2}ε↓0)$,
etc. The approximate associative law for multiplication derived
above can be recast as follows: We have
$$|(u \otimes v) \otimes w - u \otimes (v \otimes w)| < {2ε↓0\over
(1 - {1\over 2}ε↓0)↑2} |u \otimes (v \otimes w)|$$
by (19), and the same inequality is valid with
$(u \otimes v) \otimes w$ and $u \otimes (v \otimes w)$ interchanged.
Hence by (34),
$$(u \otimes v) \otimes w \approx u \otimes (v \otimes w)\quad
(ε)\eqno (39)$$
whenever $ε ≥ 2ε↓0/(1 - {1\over 2}ε↓0)↑2$. For
example, if $b = 10$ and $p = 8$ we may take $ε = 0.00000021$.
%folio 301 galley 10 (C) Addison-Wesley 1978 *
The relations $\prec$, $~$, $\succ$, and $\approx$ are useful within
numerical algorithms, and it is therefore a good idea to provide
routines for comparing floating-point numbers as well as for
doing arithmetic on them.
\def\round{\mathop{\hbox{round}}}
\yskip Let us now shift our attention back to the question
of finding {\sl exact} relations that are satisfied by the
floating-point operations. It is interesting to note that floating-point
addition and subtraction are not completely intractable from
an axiomatic standpoint, since they do satisfy the nontrivial
identities stated in the following theorems:
\thbegin Theorem A. {\sl Let $u$ and $v$ be normalized floating-point
numbers. Then
$$\biglp (u \oplus v) \ominus u\bigrp + \biglp (u \oplus v)
\ominus \biglp (u \oplus v) \ominus u\bigrp\bigrp = u \oplus v,\eqno(40)$$
provided that no exponent overflow or underflow occurs.}
\yyskip\noindent This rather cumbersome-looking identity can be
rewritten in a simpler manner: Let
\def\dprime{↑{\prime\prime}}
$$\eqalign{u↑\prime⊗=(u\oplus v)\ominus v\,,\cr\noalign{\vskip3pt}
u\dprime⊗=(u\oplus v)\ominus v↑\prime,\cr}
\qquad \eqalign{v↑\prime⊗=(u\oplus v)\ominus u\,;\cr\noalign{\vskip3pt}
v\dprime⊗=(u\oplus v)\ominus u↑\prime.\cr}\eqno(41)$$
Intuitively, $u↑\prime$ and $u\dprime$ should be approximations
to $u$, and $v↑\prime$ and $v\dprime$ should be approximations
to $v$. Theorem A tells us that
$$u \oplus v = u↑\prime + v\dprime = u\dprime + v↑\prime .\eqno(42)$$
This is a stronger statement than the identity
$$u \oplus v = u↑\prime \oplus v\dprime = u\dprime \oplus v↑\prime
,\eqno (43)$$
which follows by rounding (42).
\proofbegin Let us say that $t$ is a
{\sl tail} of $x$ modulo $b↑e$ if
$$t ≡ x\modulo {b↑e},\qquad |t| ≤{\textstyle {1\over 2}}b↑e;\eqno (44)$$
thus, $x-\round(x)$ is always a tail of $x$.
The proof of Theorem A rests largely on the following simple
fact proved in exercise 11:
\thbegin Lemma T. {\sl If $t$ is a tail of the floating-point number
$x$, then $x \ominus t = x - t$.}\quad\blackslug
\yyskip Let $w = u \oplus v$. Theorem A holds trivially
when $w = 0$. By multiplying all variables by a suitable power
of $b$, we may assume without loss of generality that $e↓w =
p$. Then $u + v = w + r$, where $r$ is a tail of $u + v$ modulo
1. Furthermore $u↑\prime =\round(w - v) =\round(u - r) =
u - r - t$, where $t$ is a tail of $u - r$ modulo $b↑e$ and
$e = e↓{u↑\prime} - p$.
If $e ≤ 0$ then $t ≡ u - r ≡ -v \modulo
{b↑e}$, hence $t$ is a tail of $-v$ and $v\dprime = \round(w
- u↑\prime ) = \round(v + t) = v + t$; this proves (40). If
$e > 0$ then $|u - r| ≥ b↑p - {1\over 2}$, and since $|r| ≤
{1\over 2}$ we have $|u| ≥ b↑p - 1$. It follows that $r$ is
a tail of $v$ modulo 1. If $u↑\prime = u$, we have $v\dprime
= \round(w - u) = \round(v - r) = v - r$. Otherwise the relation
$\round(u - r) ≠ u$ implies that $|u| = b↑p - 1$, $|r| = {1\over
2}$, $|u↑\prime | = b↑p$; we have $v\dprime = \round(w - u↑\prime
) = \round(v + r) = v + r$, because $r$ is also a tail of $-v$
in this case.\quad\blackslug
\yyskip Theorem A exhibits a regularity property of floating-point
addition, but it doesn't seem to be an especially useful result.
The following identity is more significant:
\thbegin Theorem B. {\sl Under the hypotheses of Theorem
A and $(41)$,}
$$u + v = (u \oplus v) + \biglp (u \ominus u↑\prime ) \oplus
(v \ominus v\dprime )\bigrp .\eqno (45)$$
\dproofbegin In fact, we can show that $u \ominus
u↑\prime = u - u↑\prime$, $v \ominus v\dprime = v - v\dprime
$, and $(u - u↑\prime ) \oplus (v - v\dprime ) = (u - u↑\prime
) + (v - v\dprime )$, hence (45) will follow from Theorem A\null.
Using the notation of the preceding proof, these relations are
respectively equivalent to
$$\round(t + r) = t + r,\qquad \round(t) = t,\qquad \round(r)
= r.\eqno (46)$$
Exercise 12 establishes the theorem in the special
case $|e↓u - e↓v| ≥ p$. Otherwise $u + v$ has at most $2p$ significant
digits and it is easy to see that $\round(r) = r$. If now $e
> 0$, the proof of Theorem A shows that $t = -r$ or $t = r =
\pm {1\over 2}$. If $e ≤ 0$ we have $t + r ≡ u$ and $t ≡
-v \modulo {b↑e}$; this is enough to prove that $t + r$ and
$r$ round to themselves, provided that $e↓u ≥ e$ and $e↓v ≥
e$. But either $e↓u < 0$ or $e↓v < 0$ would contradict our hypothesis
that $|e↓u - e↓v| < p$, since $e↓w = p$.\quad\blackslug
\yyskip Theorem B gives {\sl an explicit formula for the
difference} between $u + v$ and $u \oplus v$, in terms of quantities
that can be calculated directly using five operations of floating-point
arithmetic. If the radix $b$ is 2 or 3, we can improve on this
result, obtaining the exact value of the correction term with
only two floating-point operations and one (fixed-point) comparison
of absolute values:
\thbegin Theorem C. {\sl If $b ≤ 3$ and $|u| ≥ |v|$, then}
$$u + v = (u \oplus v) + \biglp u \ominus (u \oplus v)\bigrp
\oplus v.\eqno (47)$$
\penalty-400 % desirable page break (based on situation as of August 7, 1978)
\dproofbegin Following the conventions of preceding
proofs again, we wish to show that $v \ominus v↑\prime = r$.
It suffices to show that $v↑\prime = w - u$, because (46) will
then yield $v \ominus v↑\prime = \round(v - v↑\prime ) = \round(u
+ v - w) = \round(r) = r$.
We shall in fact prove (47) whenever $b ≤ 3$ and
$e↓u ≥ e↓v$. If $e↓u ≥ p$ then $r$ is a tail of $v$ modulo 1,
hence $v↑\prime = w \ominus u = v \ominus r = v - r = w - u$
as desired. If $e↓u < p$ then we must have $e↓u = p - 1$, and
$w - u$ is a multiple of $b↑{-1}$; it will therefore round to
itself if its magnitude is less than $b↑{p-1} + b↑{-1}$. Since
$b ≤ 3$, we have indeed $|w - u| ≤ |w - u-v| + |v| ≤ {1\over
2} + (b↑{p-1} - b↑{-1}) < b↑{p-1} + b↑{-1}$.\quad\blackslug
\yyskip The proofs of Theorems A\null, B\null, and C do not
rely on the precise definitions of round$(x)$ in the ambiguous
cases when $x$ is exactly midway between consecutive floating-point
numbers; any way of resolving the ambiguity will suffice for
the validity of everything we have proved so far. No rounding
rule can be best for every application. For example, we generally
want a special rule when computing our income tax. But for most
numerical calculations the best policy appears to be the rounding
scheme specified in Algorithm 4.2.1N\null, which insists that the
least significant digit should always be made even (or always
odd) when an ambiguous value is rounded. This is not a trivial
technicality, of interest only to nit-pickers; it is an important
practical consideration, since the ambiguous case arises surprisingly
often and a biased rounding rule produces significantly poor
results. For example, consider decimal arithmetic and assume
that remainders of 5 are always rounded upwards. Then if $u
= 1.0000000$ and $v = 0.55555555$ we have $u \oplus v = 1.5555556$;
and if we floating-subtract $v$ from this result we get $u↑\prime
= 1.0000001$. Adding and subtracting $v$ from $u↑\prime$ gives
1.0000002, and the next time we will get 1.0000003, etc.; the
result keeps growing although we are adding and subtracting
the same value.
This phenomenon, called {\sl drift}, will not occur
when we use a stable rounding rule based on the parity of the
least significant digit. More precisely:
\thbegin Theorem D. $\biglp ((u \oplus v) \ominus v) \oplus
v\bigrp \ominus v = (u \oplus v) \ominus v$.
\yyskip\noindent For example, if $u = 1.2345679$ and $v = -0.23456785$,
we find $u \oplus v = 1.0000000$, $(u \oplus v) \ominus v = 1.2345678$,
$((u \oplus v) \ominus v) \oplus v = 0.99999995$, $\biglp
((u \oplus v) \ominus v) \oplus v\bigrp \ominus v = 1.2345678$.
The proof for general $u$ and $v$ seems to require a case analysis
even more detailed than that in the above theorems; see the
references at the end of this section.\quad\blackslug
%folio 300 galley 11 (C) Addison-Wesley 1978 *
\yyskip Theorem D is valid both for ``round to even'' and ``round
to odd''; how should we choose between these possibilities?
When the radix $b$ is odd, ambiguous cases never arise
except during floating-point division, and the rounding
in such cases is comparatively unimportant. For {\sl even} radices, there
is reason to prefer the following rule: ``Round to even
when $b/2$ is odd, round to odd when $b/2$ is even.'' The least
significant digit of a floating-point fraction occurs frequently
as a remainder to be rounded off in subsequent calculations,
and this rule avoids generating the digit $b/2$ in the least
significant position whenever possible; its effect is to provide
some memory of an ambiguous rounding so that subsequent rounding
will tend to be unambiguous. For example, if we were to round
to odd in the decimal system, repeated rounding of the number
2.44445 to one less place each time leads to the sequence 2.4445,
2.445, 2.45, 2.5, 3; but if we round to even, such situations
do not occur.\xskip [Roy A. Keir, {\sl Inf.\ Proc.\ Letters \bf 3} (1975),
188--189.]\xskip On the other hand, some people prefer rounding to even in
all cases, so that the remainder will tend to be 0 more often. Neither
alternative conclusively dominates the other.
A reader who has checked some of the details of
the above proofs will realize the immense simplification that
has been afforded by the simple rule $u \oplus v =\hbox{round}(u
+ v)$. If our floating-point addition routine would fail to
give this result even in a few rare cases, the proofs would
become enormously more complicated and perhaps they would even
break down completely.
Theorem B fails if truncation arithmetic is used
in place of rounding, i.e., if we let $u \oplus v =\hbox{trunc}(u
+ v)$ and $u \ominus v =\hbox{trunc}(u - v)$, where trunc$(x)$ takes
all positive real $x$ into the largest floating-point number
$≤x$. An exception to Theorem B would then occur for cases such
as $(20, +.10000001) \oplus (10, -.10000001) = (20, +.10000000)$,
when the difference between $u + v$ and $u \oplus v$ cannot
be expressed exactly as a floating-point number; and also for
cases such as $12345678 \oplus .012345678$, when it can be.
Many people feel that, since floating-point arithmetic
is inexact by nature, there is no harm in making it just a little
bit less exact in certain rather rare cases, if it is convenient
to do so. This policy saves a few cents in the design of computer
hardware, or a small percentage of the average running time
of a subroutine. But the above discussion shows that such a
policy is mistaken. We could save about five percent of the
running time of the FADD subroutine, Program 4.2.1A\null, and about
25 percent of its space, if we took the liberty of rounding
incorrectly in a few cases, but we are much better off leaving
it as it is. The reason is not to glorify ``bit chasing''; a more fundamental
issue is at stake here: {\sl Numerical subroutines should deliver
results that satisfy simple, useful mathematical laws whenever
possible.} The crucial formula $u \oplus v =\hbox{round}(u + v)$
is a ``regularity'' property that makes a great deal of difference
between whether mathematical analysis of computational algorithms
is worth doing or worth avoiding. Without any underlying symmetry
properties, the job of proving interesting results becomes extremely
unpleasant. {\sl The enjoyment of one's tools is an essential
ingredient of successful work.}
\subsectionbegin{B. Unnormalized floating-point arithmetic}
The policy of
normalizing all floating-point numbers may be construed in two
ways: We may look on it favorably by saying that it is an attempt
to get the maximum possible accuracy obtainable with a given
degree of precision, or we may consider it to be potentially
dangerous since it tends to imply that the results are more
accurate than they really are. When we normalize the result
of $(1, +.31428571) \ominus (1, +.31415927)$ to $(-2, +.12644000)$,
we are suppressing information about the possibly greater inaccuracy
of the latter quantity. Such information would be retained if
the answer were left as $(1, +.00012644)$.
The input data to a problem is frequently not known
as precisely as the floating-point representation allows. For
example, the values of Avogadro's number and Planck's constant
are not known to eight significant digits, and it might be
more appropriate to denote them, respectively, by
$$(27, +.00060225)\qquad \hbox{and}\qquad (-23, +.00010545)$$
instead of by $(24, +.60225200)$ and $(-26, +.10545000)$.
It would be nice if we could give our input data for each problem
in an unnormalized form that expresses how much precison is
assumed, and if the output would indicate just how much precision
is known in the answer. Unfortunately, this is a terribly difficult
problem, although the use of unnormalized arithmetic can help
to give some indication. For example, we can say with a fair
degree of certainty that the product of Avogadro's number by
Planck's constant is $(0, +.00063507)$, and that their sum is $(27,
+.00060225)$.\xskip (The purpose of this example is not to suggest
that any important physical significance should be attached
to the sum and product of these fundamental constants; the point
is that it is possible to preserve a little of the information
about precision in the result of calculation with imprecise
quantities, when the original operands are independent of each
other.)
The rules for unnormalized arithmetic are simply
this: Let $l↓u$ be the number of leading zeros in the fraction
part of $u = (e↓u, f↓u)$, so that $l↓u$ is the largest integer
$≤p$ with $|f↓u| < b↑{-l↓u}$. Then addition and subtraction
are performed just as in Algorithm 4.2.1A\null, except that all scaling
to the left is suppressed. Multiplication and division are performed
as in Algorithn 4.2.1M, except that the answer is scaled right
or left so that precisely $\max(l↓u, l↓v)$ leading zeros appear.
Essentially the same rules have been used in manual calculation
for many years.
It follows that, for unnormalized computations,
$$\baselineskip16pt\eqalignno{
e↓{u\oplus v},\, e↓{u\ominus v}⊗= \max(e↓u, e↓v) + \hbox{(0 or 1)}⊗(48)\cr
e↓{u\otimes v}⊗= e↓u + e↓v - q - \min(l↓u, l↓v) - \hbox{(0 or 1)}⊗(49)\cr
e↓{u\odiv v}⊗= e↓u - e↓v + q - l↓u + l↓v + \max(l↓u,l↓v) + \hbox{(0 or 1)}.⊗(50)\cr
}$$
When the result of a calculation is zero, an unnormalized
zero (often called an ``order of magnitude zero'') is given
as the answer; this indicates that the answer may not truly
be zero, we just don't know any of its significant digits.
Error analysis takes a somewhat different form
with unnormalized floating-point arithmetic. Let us define
$$\delta ↓u = {\textstyle{1\over 2}}b↑{e↓u-q-p}\qquad\hbox{if $u = (e↓u,
f↓u)$.}\eqno (51)$$
This quantity depends on the representation of
$u$, not just on the value $b↑{e↓u-q}f↓u$. Our rounding
rule tells us that
$$\eqalign{|u\oplus v-(u+v)|⊗≤\delta↓{u\oplus v},\cr\noalign{\vskip4pt}
|u\otimes v-(u\times v)|⊗≤\delta↓{u\otimes v},\cr}\qquad
\eqalign{|u\ominus v-(u-v)|⊗≤\delta↓{u\ominus v},\cr\noalign{\vskip4pt}
|u \odiv v - (u\>\,/\>\,v)|⊗≤ \delta ↓{u\odiv v}.\cr}$$
These inequalities apply to normalized as well
as unnormalized arithmetic; the main difference between the
two types of error analysis is the definition of the exponent
of the result of each operation $\biglp$Eqs.\ (48) to (50)$\bigrp$.
We have remarked that the relations $\prec$, $~$, $\succ$,
and $\approx$ defined earlier in this section are valid and meaningful
for unnormalized numbers as well as for normalized numbers.
As an example of the use of these relations, let us prove an
approximate associative law for unnormalized addition, analogous
to (39):
$$(u \oplus v) \oplus w \approx u \oplus (v \oplus w)\quad
(ε),\eqno (52)$$
for suitable $ε$. We have
$$\eqalign{|(u \oplus v) \oplus w - (u + v + w)| ⊗≤ \left|(u \oplus v)
\oplus w - \biglp (u \oplus v) + w\bigrp\right|\cr⊗\qquad\null + |u
\oplus v - (u + v)|\cr\noalign{\vskip4pt}
⊗≤ \delta ↓{(u \oplus v) \oplus w}+
\delta ↓{u\oplus v}\cr\noalign{\vskip4pt}
⊗≤ 2\delta ↓{(u \oplus v) \oplus w}.\cr}$$
A similar formula holds for $|u \oplus (v \oplus w)
- (u + v + w)|$. Now since $e↓{(u \oplus v) \oplus w} = \max(e↓u,
e↓v, e↓w) + \hbox{(0, 1, or 2)}$, we have $\delta ↓{(u \oplus v) \oplus
w)} ≤ b↑2\delta ↓{u \oplus (v \oplus w)}$. Therefore we find
that (52) is valid when $ε ≥ 2b↑{2-p}$; unnormalized addition
is not as erratic as normalized addition with respect to the
associative law.
%folio 306 galley 12 (C) Addison-Wesley 1978 *
It should be emphasized that unnormalized arithmetic
is by no means a panacea. There are examples where it indicates
greater accuracy than is present (e.g., addition of a great many
small quantities of about the same magnitude, or evaluation
of $x↑n$ for large $n$); and there are many more
examples when it indicates poor accuracy while normalized arithmetic
actually does produce good results. There is an important reason
why no straightforward one-operation-at-a-time method of error
analysis can be completely satisfactory, namely the fact that operands are
usually not independent of each other. This means that errors
tend to cancel or reinforce each other in strange ways. For
example, suppose that $x$ is approximately ${1\over 2}$, and
suppose that we have an approximation $y = x + \delta$ with
absolute error $\delta $. If we now wish to compute $x(1 - x)$,
we can form $y(1 - y)$; if $x = {1\over 2} + ε$ we find $y(1
- y) = x(1 - x) - 2ε\delta - \delta ↑2$, so the error has decreased
by a factor of $2ε + \delta $. This is just one case where multiplication
of imprecise quantities can lead to a quite accurate result
when the operands are not independent of each other. A more
obvious example is the computation of $x \ominus x$, which can
be obtained with perfect accuracy regardless of how bad an approximation
to $x$ we begin with.
The extra information that unnormalized arithmetic
gives us can often be more important than the information it
destroys during an extended calculation, but (as usual) we must
use it with care. Examples of the proper use of unnormalized
arithmetic are discussed by R. L. Ashenhurst and N. Metropolis
in {\sl Computers and Computing, AMM} Slaught Memorial Papers
{\bf 10} (February, 1965), 47--59; by N. Metropolis in {\sl Numer.\
Math.\ \bf7} (1965), 104--112; and by R. L. Ashenhurst
in {\sl Error in Digital Computation \bf 2}, ed.\ by L. B.
Rall (New York: Wiley, 1965), 3--37. Appropriate methods for
computing standard mathematical functions with both input and
output in unnormalized form are given by R. L. Ashenhurst in
{\sl JACM \bf 11} (1964), 168--187. An extension of unnormalized arithmetic,
which remembers that certain values are known to be {\sl exact},
has been discussed by N. Metropolis in {\sl IEEE Trans.\ \bf C--22}
(1973), 573--576.
\def\upper#1{\mathop{\hbox to 8.889pt{\:u\char'156\hskip-8.889pt\hfill
$\vcenter{\hbox{$\scriptscriptstyle#1$}}$\hfill}}}
\def\lwr#1{\mathop{\hbox to 8.889pt{\:u\char'157\hskip-8.889pt\hfill
$\vcenter{\hbox{$\scriptscriptstyle#1$}}$\hfill}}}
\def\\{\raise 3.444pt\hbox{$\scriptscriptstyle\!\!\not\,\,$}} % \\h will be h bar
\def\¬#1{\vcenter{\hbox{\:@\char'#1}}}
\subsectionbegin{C. Interval arithmetic} Another approach
to the problem of error determination is the so-called ``interval''
or ``range'' arithmetic, in which upper and lower bounds on
each number are maintained during the calculations. Thus, for
example, if we know that $u↓0 ≤ u ≤ u↓1$ and $v↓0 ≤ v ≤ v↓1$,
we represent this by the interval notation $u = [u↓0, u↓1]$, $v
= [v↓0, v↓1]$. The sum $u \oplus v$ is $[u↓0 \lwr+ v↓0, u↓1
\upper+ v↓1]$, where $\lwr+$ denotes ``lower floating-point addition,''
the greatest representable number less than or equal to the
true sum, and $\upper+$ is defined similarly (see exercise 4.2.1--13).
Furthermore $u \ominus v = [u↓0 \lwr- v↓1, u↓1 \upper- v↓0]$;
and if $u↓0$ and $v↓0$ are positive, we have $u \otimes v =
[u↓0 \lwr\times v↓0, u↓1 \upper\times v↓1]$, $u \odiv v = [u↓0 \lwr/ v↓1,
u↓1 \upper/ v↓0]$. For example, we might represent Avogadro's
number and Planck's constant as
$$\eqalign{N⊗=\¬2 (24, +.60222400),\, (24, +.60228000)\¬3,\cr\noalign{\vskip 3pt}
\\h⊗=\¬2 (-26, +.10544300),\, (-26,+.10545700)\¬3;\cr}$$
their sum and product would then turn out to be
$$\eqalign{N \oplus \\h⊗ = \¬2 (24, +.60222400),\,(24, +.60228001)\¬3,\cr
\noalign{\vskip 3pt}
N \otimes \\h⊗= \¬2 (-3, +.63500305),\,(-3, +.63514642)\¬3.\cr}$$
If we try to divide by $[v↓0, v↓1]$
when $v↓0 < 0 < v↓1$, there is a possibility of division by
zero. Since the philosophy underlying interval arithmetic is
to provide rigorous error estimates, a divide-by-zero error
should be signalled in this case. However, overflow and underflow
need not be treated as errors in interval arithmetic, if special
conventions are introduced as discussed in exercise 24.
Interval arithmetic takes only about twice as long
as ordinary arithmetic, and it provides truly reliable error
estimates. Considering the difficulty of mathematical error
analyses, this is indeed a small price to pay. Since the intermediate
values in a calculation often depend on each other, as explained
above, the final estimates obtained with interval arithmetic
will tend to be pessimistic; and iterative numerical methods
often have to be redesigned if we want to deal with intervals.
The prospects for effective use of interval arithmetic look
very good, however, and efforts should be made to increase its
availability.
\subsectionbegin{D. History and bibliography} Jules
Tannery's classic treatise on decimal calculations, {\sl Le\c cons
d'Arithm\'etique} (Paris: Colin, 1894), stated that
positive numbers should be rounded upwards if the first discarded
digit is 5 or more; since exactly half of the decimal digits
are 5 or more, he felt that this rule would round upwards exactly half of the
time, on the average, so it would produce
compensating errors. The idea of ``round to
even'' in the ambiguous cases seems to have been mentioned first
by James B. Scarborough in the first edition of his pioneering
book {\sl Numerical Mathematical Analysis} (Baltimore: Johns
Hopkins Press, 1930), p.\ 2; in the second (1950) edition he
amplified his earlier remarks, stating that ``It should be obvious
to any thinking person that when a 5 is cut off, the preceding
digit should be increased by 1 in only {\sl half} the cases,''
and he recommended round-to-even in order to achieve this.
The first analysis of floating-point arithmetic
was given by F. L. Bauer and K. Samelson, {\sl Zeitschrift f\"ur
angewandte Math.\ und Physik \bf 5} (1953), 312--316. The
next publication was not until over five years later: J. W.
Carr III, {\sl CACM \bf 2}, 5 (May 1959), 10--15. See also
P. C. Fischer, {\sl Proc.\ ACM Nat.\ Meeting \bf 13} (Urbana,
Illinois, 1958), paper 39. The book {\sl Rounding Errors in
Algebraic Processes} (Englewood Cliffs: Prentice-Hall, 1963),
by J. H. Wilkinson, shows how to apply error analysis of the
individual arithmetic operations to the error analysis of large-scale
problems; see also his treatise on {\sl The Algebraic Eigenvalue
Problem} (Oxford: Clarendon Press, 1965).
The relations $\prec$, $~$, $\succ$, $\approx$ introduced in
this section are similar to ideas published by A. van Wijngaarden
in {\sl BIT \bf 6} (1966), 66--81. Theorems A and B above were
inspired by some related work of Ole M\o ller, {\sl BIT
\bf 5} (1965), 37--50, 251--255; Theorem C is due to T. J. Dekker,
{\sl Numer.\ Math.\ \bf 18} (1971), 224--242. Extensions and refinements
of all three theorems have been published by S. Linnainmaa,
{\sl BIT \bf 14} (1974), 167--202. W. M. Kahan introduced
Theorem D in some unpublished notes; for a complete proof and
further commentary, see J. F. Reiser and D. E. Knuth, {\sl Inf.\
Proc.\ Letters \bf 3} (1975), 84--87, 164.
Unnormalized floating-point arithmetic was recommended
by F. L. Bauer and K. Samelson in the article cited above, and
it was independently used by J. W. Carr III at the University
of Michigan in 1953. Several years later, the MANIAC III computer
was designed to include both kinds of arithmetic in its hardware;
see R. L. Ashenhurst and N. Metropolis, {\sl JACM \bf 6} (1959),
415--428, {\sl IEEE Trans.\ \bf EC--12}
(1963), 896--901; R. L. Ashenhurst, {\sl Proc.\ Spring Joint
Computer Conf.\ \bf 21} (1962), 195--202. See also H. L. Gray
and C. Harrison, Jr., {\sl Proc.\ Eastern Joint Computer Conf.\
\bf 16} (1959), 244--248, and W. G. Wadey, {\sl JACM \bf 7}
(1960), 129--139, for further early discussions of unnormalized
arithmetic.
For early developments in interval arithmetic,
and some modifications, see A. Gibb, {\sl CACM \bf 4} (1961),
319--320; B. A. Chartres, {\sl JACM \bf 13} (1966), 386--403;
and the book {\sl Interval Analysis} by Ramon E. Moore (Prentice-Hall,
1966).
More recent work on floating-point accuracy is
summarized in two important papers that can be especially recommended
for further study: W. M. Kahan, {\sl Proc.\ IFIP Congress} (1971),
{\bf 2}, 1214--1239; R. P. Brent, {\sl IEEE Trans.\ \bf C--22}
(1973), 601--607. Both papers include useful theory and demonstrate
that it pays off in practice.
The book {\sl Grundlagen des Numerischen Rechnens---Mathematische Be\-grund\-ung
der Rechenarithmetik} by Ulrich Kulisch (Mannheim: Bibl.\ Inst.,
1976) is entirely devoted to
the study of floating-point arithmetic systems; see also Kulisch's article in
{\sl IEEE Trans.\ \bf C--26} (1977), 610--621.
\exbegin{EXERCISES}
\vskip 3pt plus 3pt minus 1pt
\noindent {\sl Note: }Normalized floating-point arithmetic
is assumed unless the contrary is specified.
\exno 1. [M18] Prove that identity (7) is a consequence of (2)
through (6).
\exno 2. [M20] Use identities (2) through (8) to prove that
$(u \oplus x) \oplus (v \oplus y) ≥ u \oplus v$ whenever $x
≥ 0$ and $y ≥ 0$.
\exno 3. [M20] Find eight-digit floating-decimal numbers $u$,
$v$, and $w$ such that
$$u \otimes (v \otimes w) ≠ (u \otimes v) \otimes w,$$
and such that no exponent overflow or underflow occurs during
the computations.
\exno 4. [10] Is it possible to have floating-point numbers
$u$, $v$, and $w$ for which exponent overflow occurs during the
calculation of $u \otimes (v \otimes w)$ but not during the
calculation of $(u \otimes v) \otimes w$?
\exno 5. [M20] Is $u\odiv v=u\otimes(1\odiv v)$ an identity, for
all floating-point numbers $u$ and $v≠0$ such that no exponent
overflow or underflow occurs?
\exno 6. [M22] Are either of the following two identities valid
for all floating-point numbers $u$?\xskip (a) $0 \ominus (0 \ominus
u) = u$.\xskip (b) $1 \odiv (1 \odiv u) = u$.
\def\expcirc#1{{\hbox to 9pt{\hfill
\hbox to 0pt{\hskip 0pt minus 100pt\raise 5.0833pt
\hbox{\:@\char'142}\hskip 0pt minus 100pt}\!
\hbox to 0pt{\hskip 0pt minus 100pt\:e#1\hskip 0pt minus 100pt}\hfill}}}
\exno 7. [M21] Let $u↑\expcirc2$ stand for $u \otimes
u$. Find floating-binary numbers $u$ and $v$ such that $2(u↑\expcirc2+v↑\expcirc2)
<(u\oplus v)↑\expcirc2$.
\trexno 8. [20] Let $ε = 0.0001$; which of the relations $$u \prec
v\quad(ε),\qquad u ~ v\quad(ε),\qquad u \succ v\quad(ε),\qquad u\approx v\quad(ε)$$
hold for the following pairs
of base 10, excess 0, eight-digit floating-point numbers?
\yskip\textindent{a)}$u = ( 1, +.31415927)$, $v = ( 1, +.31416000)$;
\textindent{b)}$u = ( 0, +.99997000)$, $v = ( 1, +.10000039)$;
\textindent{c)}$u = (24, +.60225200)$, $v = (27, +.00060225)$;
\textindent{d)}$u = (24, +.60225200)$, $v = (31, +.00000006)$;
\textindent{e)}$u = (24, +.60225200)$, $v = (32, +.00000000)$.
\exno 9. [M22] Prove (33), and explain why the conclusion cannot
be strengthened to the relation $u \approx w\;(ε↓1 + ε↓2)$.
%folio 313 galley 13 (C) Addison-Wesley 1978 *
\trexno 10. [M25] (W. M. Kahan.)\xskip A certain computer
performs floating-point arithmetic without proper rounding,
and, in fact, its floating-point multiplication routine ignores
all but the first $p$ most significant digits of the $2p$-digit
product $f↓uf↓v$.\xskip (Thus when $f↓uf↓v < 1/b$, the least-significant
digit of $u \otimes v$ always comes out to be zero, due to subsequent
normalization.)\xskip Show that this causes the monotonicity of multiplication
to fail; i.e., there are positive normalized floating-point
numbers $u$, $v$, $w$ such that $u < v$ but $u \otimes w > v \otimes
w$.
\exno 11. [M20] Prove Lemma T.
\exno 12. [M24] Carry out the proof of Theorem B and (46) when
$|e↓u - e↓v| ≥ p$.
\def\omod{\hbox to 25pt{\hfill$\vcenter{\hbox{\:@\char'140}}$\hfill}\hskip-25pt
\raise .3pt\hbox to 25pt{\hfill$\vcenter{\moveleft .2pt\hbox{\:emod}}$\hfill}}
\trexno 13. [M25] Some programming languages (and even some computers)
make use of floating-point arithmetic only, with no provision
for exact calculations with integers. If operations on integers
are desired, we can, of course, represent an integer as a floating-point
number; and when the floating-point operations satisfy the basic
definitions in (9), we know that all floating-point operations
will be exact, provided that the operands and the answer can each be
represented exactly with $p$ significant digits. Therefore---so long as we
know that the numbers aren't too large---we can add, subtract, or multiply
integers with no inaccuracy due to rounding errors.
But suppose that a programmer wants to determine if $m$ is an exact multiple
of $n$, when $m$ and $n≠0$ are integers. Suppose further that a subroutine is
available to calculate
the quantity round$(u\mod 1) = u\omod 1$ for any given floating-point
number $u$, as in exercise 4.2.1--15. One good way to determine whether or not
$m$ is a multiple of $n$ would perhaps be to test whether or not $(m\odiv
n)\omod1=0$, using the assumed subroutine; but perhaps rounding errors
in the floating-point calculations will invalidate this test.
Find suitable conditions on the range of integer values $n≠0$ and $m$, such that
$m$ is a multiple of $n$ if and only if $(m\odiv n)\omod1=0$. In
other words, show that if $m$ and $n$ are not too large, this test is valid.
\exno 14. [M27] Find a suitable $ε$ such that $(u\otimes v)\otimes w\approx
u\otimes(v\otimes w)\;(ε)$, when {\sl unnormalized} multiplication is being
used.\xskip
$\biglp$This generalizes (41), since unnormalized multiplication is exactly
the same as normalized multiplication when the input operands $u$, $v$, and
$w$ are normalized.$\bigrp$
\trexno 15. [M24] (H. Bj\"orck.)\xskip Does the computed midpoint of an interval
always lie between the endpoints?\xskip (In other words, does $u≤v$ imply that
$u≤(u\oplus v)\odiv2≤v$?)
\exno 16. [M28] (a) What is $\biglp\,\cdotss((x↓1\oplus x↓2)\oplus x↓3)\oplus\cdots
\oplus x↓n\bigrp$ when $n=10↑6$ and $x↓k=1.1111111$ for all $k$, using eight-digit
floating-decimal arithmetic?\xskip(b) What happens when Eq.\ (14) is used to
calculate the standard deviation of these particular
values $x↓k$? What happens when Eqs.\ (15) and (16)
are used instead?\xskip(c) Prove that $S↓k≥0$ in (16), for all choices of $x↓1$,
$\ldotss$, $x↓k$.
\exno 17. [28] Write a \MIX\ subroutine, \.{FCMP}, which compares the floating-point
number $u$ in location \.{ACC} with the floating-point number $v$
in register A, and which sets the comparison indicator to \.{LESS},
\.{EQUAL}, or \.{GREATER}, according as $u \prec v$, $u ~ v$, or $u \succ v\;
(ε)$; here $ε$ is stored in location \.{EPSILON} as a nonnegative
fixed-point quantity with the decimal point assumed at the left
of the word. Assume normalized inputs.
\exno 18. [M40] In unnormalized arithmetic is there a suitable
number $ε$ such that
$$u \otimes (v \oplus w) \approx (u \otimes v) \oplus (u \otimes
w)\quad (ε)?$$
\trexno 19. [M30] (W. M. Kahan.)\xskip Consider
the following procedure for floating-point summation of $x↓1,
\ldotss , x↓n$:
$$\eqalign{s↓0⊗ = c↓0 = 0;\cr y↓k⊗= x↓k \ominus c↓{k-1},\quad s↓k = s↓{k-1}
\oplus y↓k,\quad c↓k = (s↓k \ominus s↓{k-1}) \ominus y↓k,\qquad
\hbox{for $1 ≤ k ≤ n$.}\cr}$$
Let the relative errors in these operations be
defined by the equations
$$\vbox{\halign{$\hfill#\hfill$\cr
y↓k = (x↓k - c↓{k-1})(1 + \eta ↓k),\qquad s↓k = (s↓{k-1}
+ y↓k)(1 + \sigma ↓k),\cr\noalign{\vskip3pt}c↓k = \biglp(s↓k - s↓{k-1})(1 + \gamma
↓k) - y↓k)(1 + \delta ↓k),\cr}}$$
where $|\eta ↓k|, |\sigma
↓k|,|\gamma↓k|,|\delta↓k| ≤ ε$. Prove that $s↓n = \sum ↓{1≤k≤n} (1 + \theta ↓k)x↓k$,
where $|\theta ↓k| ≤ 2ε + O(nε↑2)$.\xskip$\biglp$Theorem C says that if
$b = 2$ and $|s↓{k-1}| ≥ |y↓k|$ we have $s↓{k-1} + y↓k = s↓k
- c↓k$ exactly. But in this exercise we want to obtain an estimate
that is valid {\sl even when floating-point operations are
not carefully rounded,} assuming only that each operation has
bounded relative error.$\bigrp$
\exno 20. [25] (S. Linnainmaa.)\xskip Find all $u, v$ for which $|u|
≥ |v|$ and (47) fails.
\exno 21. [M35] (T. J. Dekker.)\xskip Theorem C shows how to do exact
addition of floating-binary numbers. Explain how to do {\sl
exact multiplication\/}: Express the product $uv$ in the form
$w + w↑\prime $, where $w$ and $w↑\prime$ are computed from
two given floating-binary numbers $u$ and $v$, using only the
operations $\oplus$, $\ominus$, and $\otimes$.
\exno 22. [M30] Can drift occur in floating-point multiplication/division?
Consider the sequence $x↓0 = u$, $x↓{2n+1} = x↓{2n} \otimes v$,
$x↓{2n+2} = x↓{2n+1} \odiv v$, given $u$ and $v$; what is the
largest $k$ such that $x↓k ≠ x↓{k+2}$ is possible?
\trexno 23. [M26] Prove or disprove: $u \ominus (u\omod 1) = \lfloor
u\rfloor$, for all floating-point $u$.
\exno 24. [M27] Consider the set of all intervals $[u↓l,u↓r]$,
where $u↓l$ and $u↓r$ are either
nonzero floating-point numbers or the special symbols $+0$, $-0$,
$+∞$, $-∞$; each interval must have $u↓l≤u↓r$, and $u↓l=u↓r$ is allowed only when
$u↓l$ is finite and nonzero. The interval $[u↓l,u↓r]$ stands for all floating-point
$x$ such that $u↓l ≤ x ≤ u↓r$, where we regard $-∞ < -x < -0 < 0
+0 < +x < +∞$ for all positive $x$.\xskip $\biglp$Thus, $[1, 2]$ means $1 ≤
x ≤ 2$; $[+0, 1]$ means $0 < x ≤ 1$; $[-0, 1]$ means $0 ≤ x ≤ 1$;
$[-0, +0]$ denotes the single value 0; and $[-∞, +∞]$ stands for
everything.$\bigrp$\xskip Show how to define appropriate arithmetic operations on
all such intervals, without resorting to ``overflow'' or ``underflow'' or other
anomalous indications except when dividing by an interval that includes zero.
\trexno 25. [15] When people speak about inaccuracy in floating-point
arithmetic they often ascribe errors to ``cancellation'' that
occurs during the subtraction of nearly equal quantities. But
when $u$ and $v$ are approximately equal, the difference $u
\ominus v$ is obtained exactly, with no error. What do these
people really mean?
\def\\{\hbox{\hskip4pt$\A{\vbox{\moveleft 4pt\hbox{\it f}}}$}}
\exno 26. [HM30] (H. G. Diamond.)\xskip Suppose $f(x)$ is a strictly increasing
function on some interval $[x↓0,x↓1]$, and let $g(x)$ be the inverse function.\xskip
(For example, $f$ and $g$ might be ``exp'' and ``ln'', or ``tan'' and
``arctan''.)\xskip
If $x$ is a floating-point number such that $x↓0≤x≤x↓1$, let $\\(x)=\hbox{round}
(f(x))$, and if $y$ is a floating-point number such that $f(x↓0)≤y≤f(x↓1)$, let
$\A g(y)=\hbox{round}(g(y))$; furthermore, let $h(x)=\A g(\\(x))$, whenever
this is defined. Although $h(x)$ won't always be equal to $x$, due to rounding,
we expect $h(x)$ to be ``near'' $x$.
Prove that if the precision $b↑p$ is at least 3, and if $f$ is strictly
concave or strictly convex (i.e., $f↑{\prime\prime}(x)<0$ or $f↑{\prime\prime}(x)
>0$ for all $x$ in $[x↓0,x↓1]$), then repeated application of $h$ will be
{\sl stable} in the sense that $$h(h(h(x)))=h(h(x)),$$ for all $x$ such that
both sides of this equation are defined. In other words, there will be no
``drift'' if the subroutines are properly implemented.
\trexno 27. [M25] (W. M. Kahan.)\xskip Let $f(x) = 1 + x + x↑2 + \cdots
+ x↑{106} = (1 - x↑{107})/(1 - x)$ for $x < 1$, and let $g(y)
= f\biglp ({1\over 3} - y↑2)(3 + 3.45y↑2)\bigrp$
for $0 < y < 1$. Evaluate $g(y)$ on one or more ``pocket calculators,''
for $y = 10↑{-3}$, $10↑{-4}$, $10↑{-5}$, $10↑{-6}$, and explain all
inaccuracies in the results you obtain.\xskip$\biglp$Since
most present-day calculators
do not round correctly, the results are often surprising. Note
that $g(ε) = 107 - 10491.35ε↑2 + O(ε↑4)$.$\bigrp$
%folio 317 galley 14 (C) Addison-Wesley 1978 *
\runningrighthead{DOUBLE-PRECISION CALCULATIONS}
\section{4.2.3}
\sectionskip
\sectionbegin{\star 4.2.3. Double-Precision Calculations}
Up to now we have considered ``single-precision''
floating-point arithmetic, which essentially means that the
floating-point values we have dealt with can be stored in a
single machine word. When single-precision floating-point arithmetic
does not yield sufficient accuracy for a given application,
the precision can be increased by suitable programming techniques
that use two or more words of memory to represent each number.
Although we shall discuss the general question of high-precision
calculations in Section 4.3, it is appropriate to give a separate
discussion of double-precision here. Special techniques apply
to double precision that are comparatively inappropriate for
higher precisions; and double precision is a reasonably important
topic in its own right, since it is the first step beyond single
precision and it is applicable to many problems that do not require
extremely high precision.
Double-precision calculations are almost always
required for floating-point rather than fixed-point arithmetic,
except perhaps in statistical work where fixed-point double-precision
is commonly used to calculate sums of squares and
cross products; since fixed-point versions of double-precision
arithmetic are simpler than floating-point versions, we shall
confine our discussion here to the latter.
Double precision is quite frequently desired not
only to extend the precision of the fraction parts of floating-point
numbers, but also to increase the range of the exponent part.
Thus we shall deal in this section with the following two-word
format for double-precision floating-point numbers in the \MIX\
computer:
$$\def\\{\vrule height 12.4pt depth 7.4pt}
\vcenter{\hbox to 124pt{\hfill\vbox{\baselineskip0pt\lineskip0pt
\hrule
\hbox{\\ \hbox to 20pt{\hfill$\pm$\hfill\\}\!
\hbox to 20pt{\hfill$e$\hfill\\}\!
\hbox to 20pt{\hfill$e$\hfill\\}\!
\hbox to 20pt{\hfill$f$\hfill\\}\!
\hbox to 20pt{\hfill$f$\hfill\\}\!
\hbox to 20pt{\hfill$f$\hfill\\}}
\hrule}\hfill}}\qquad
\vcenter{\hbox to 124pt{\hfill\vbox{\baselineskip0pt\lineskip0pt
\hrule
\hbox{\\ \hbox to 20pt{\hfill\\}\!
\hbox to 20pt{\hfill$f$\hfill\\}\!
\hbox to 20pt{\hfill$f$\hfill\\}\!
\hbox to 20pt{\hfill$f$\hfill\\}\!
\hbox to 20pt{\hfill$f$\hfill\\}\!
\hbox to 20pt{\hfill$f$\hfill\\}}
\hrule}\hfill}}.\eqno(1)$$
Here two bytes are used for the exponent and eight
bytes are used for the fraction. The exponent is ``excess $b↑2/2$,''
where $b$ is the byte size. The sign will appear in the most
significant word; it is convenient to ignore the sign of the
other word completely.
Our discussion of double-precision arithmetic will
be rather machine-oriented, because it is only by studying the
problems involved in coding these routines that a person can
properly appreciate the subject. A careful study of the \MIX\
programs below is therefore essential to the understanding of
the material.
In this section we shall depart from the idealistic
goals of accuracy stated in the previous two sections; our double-precision
routines will {\sl not} round their results, and a little bit
of error will sometimes be allowed to creep in. Users dare not
trust these routines too much. There was ample reason to squeeze
out every possible drop of accuracy in the single-precision
case, but now we face a different situation:\xskip (a) The extra programming
required to ensure true double-precision rounding in all cases
is considerable; fully accurate routines would take, say, twice
as much space and half again as much more time. It was comparatively
easy to make our single-precision routines perfect, but double
precision brings us face to face with our machine's limitations.
$\biglp$A similar situation occurs with respect to other floating-point
subroutines; we can't expect the cosine routine to compute round$(\cos
x)$ exactly for all $x$, since that turns out to be virtually
impossible. Instead, the cosine routine should provide the best
relative error it can achieve with reasonable speed, for all
reasonable values of $x$; and the designer of the routine should
try to make the computed function satisfy simple mathematical
laws whenever possible (e.g., $\cos (-x) = \cos x$, $|\cos x|
≤ 1$, $\cos x ≥ \cos y$ for $0 ≤ x ≤ y < π$).$\bigrp$\xskip (b) Single-precision
arithmetic is a ``staple food'' that everybody who wants to
employ floating-point arithmetic must use, but double precision
is usually for situations where such clean results aren't as
important. The difference between seven- and eight-place accuracy
can be noticeable, but we rarely care about the difference between
15- and 16-place accuracy. Double precision is most often used
for intermediate steps during the calculation of single-precision
results; its full potential isn't needed.\xskip (c) It will be instructive
for us to analyze these procedures in order to see how inaccurate
they can be, since they typify the types of short cuts generally
taken in bad single-precision routines (see exercises 7 and
8).
Let us now consider addition and subtraction operations
from this standpoint. Subtraction is, of course, converted to addition
by changing the sign of the second operand. Addition
is performed by separately adding together the least-significant
halves and the most-significant halves, propagating ``carries''
appropriately. A difficulty arises, however,
since we are doing signed-magnitude arithmetic:
it is possible to add the least-significant halves and to get
the wrong sign (namely, when the signs of the operands are opposite
and the least-significant half of the smaller operand is bigger
than the least-significant half of the larger operand). The simplest solution
is to anticipate the correct sign; so in step A2
(cf.\ Algorithm 4.2.1A), we not only assume that $e↓u ≥ e↓v$,
we also assume that $|u| ≥ |v|$. This means we can be sure that
the final sign will be the sign of $u$. In other respects,
double-precision addition is very much like its single-precision counterpart,
only everything is done twice.
\algbegin Program A (Double-precision addition).
The subroutine \.{DFADD} adds a double-precision floating-point
number $v$, having the form (1), to a double-precision floating-point
number $u$, assuming that $v$ is initially in rAX (i.e., registers
A and X), and that $u$ is initially stored in locations \.{ACC}
and \.{ACCX}. The answer appears both in rAX and in (\.{ACC}, \.{ACCX}).
The subroutine \.{DFSUB} subtracts $v$ from $u$ under the same conventions.
Both input operands are assumed to be normalized, and the answer
is normalized. The last portion of this program is a double-precision
normalization procedure that is used by other subroutines of
this section. Exercise 5 shows how to improve this program significantly.
{\yyskip\tabskip 10pt\mixfour{\!
01⊗ABS⊗EQU⊗1:5⊗Field definition for absolute value\cr
02⊗SIGN⊗EQU⊗0:0⊗Field definition for sign\cr
03⊗EXPD⊗EQU⊗1:2⊗Double-precision exponent field\cr
\\04⊗DFSUB⊗STA⊗TEMP⊗Double-precision subtraction:\cr
05⊗⊗LDAN⊗TEMP⊗Change sign of $v$.\cr
\\06⊗DFADD⊗STJ⊗EXITDF⊗Double-precision addition:\cr
07⊗⊗CMPA⊗ACC(ABS)⊗Compare $|u|$ with $|v|$.\cr
08⊗⊗JG⊗1F\cr
09⊗⊗JL⊗2F\cr
10⊗⊗CMPX⊗ACCX(ABS)\cr
11⊗⊗JLE⊗2F\cr
\\12⊗1H⊗STA⊗ARG⊗If $|u| < |v|$, interchange $u ↔ v$.\cr
13⊗⊗STX⊗ARGX\cr
14⊗⊗LDA⊗ACC\cr
15⊗⊗LDX⊗ACCX\cr
16⊗⊗ENT1⊗ACC⊗(\.{ACC} and \.{ACCX} are in consecutive\cr
17⊗⊗MOVE⊗ARG(2)⊗\qquad locations.)\cr
\\18⊗2H⊗STA⊗TEMP⊗Now \.{ACC} has the sign of the answer.\cr
19⊗⊗LD1N⊗TEMP(EXPD)⊗$\rI1 ← -e↓v$.\cr
20⊗⊗LD2⊗ACC(EXPD)⊗$\rI2 ← e↓u$.\cr
21⊗⊗INC1⊗0,2⊗$\rI1 ← e↓u - e↓v$.\cr
\\22⊗⊗SLAX⊗2⊗Remove exponent.\cr
23⊗⊗SRAX⊗1,1⊗Scale right.\cr
\\24⊗⊗STA⊗ARG⊗$0\;v↓1\;v↓2\;v↓3\;v↓4$\cr
25⊗⊗STX⊗ARGX⊗$\qquad\qquad\quad v↓5\;v↓6\;v↓7\;v↓8\;v↓9$\cr
26⊗⊗STA⊗ARGX(SIGN)⊗Store true sign in both halves.\cr
\\27⊗⊗LDA⊗ACC\cr
28⊗⊗LDX⊗ACCX\cr
29⊗⊗SLAX⊗2⊗Remove exponent.\cr
30⊗⊗STA⊗ACC⊗$u↓1\;u↓2\;u↓3\;u↓4\;u↓5$\cr
\\31⊗⊗SLAX⊗4\cr
32⊗⊗ENTX⊗1\cr
33⊗⊗STX⊗EXPO⊗$\.{EXPO} ← 1$ (see below).\cr
34⊗⊗SRC⊗1⊗$1\;u↓5\;u↓6\;u↓7\;u↓8$\cr
35⊗⊗STA⊗1F(SIGN)⊗A trick, see comments in text.\cr
\\36⊗⊗ADD⊗ARGX(0:4)⊗Add $0\;v↓5\;v↓6\;v↓7\;v↓8$.\cr
37⊗⊗SRAX⊗4\cr
38⊗1H⊗DECA⊗1⊗Recover from inserted 1.\xskip (Sign varies)\cr
39⊗⊗ADD⊗ACC(0:4)⊗Add most significant halves.\cr
40⊗⊗ADD⊗ARG⊗(Overflow cannot occur)\cr
\noalign{\yskip}
41⊗DNORM⊗JANZ⊗1F⊗Normalization routine:\cr
42⊗⊗JXNZ⊗1F⊗$f↓w$ in rAX, $e↓w = \.{EXPO} + \rI2$.\cr
43⊗DZERO⊗STA⊗ACC⊗If $f↓w = 0$, set $e↓w ← 0$.\cr
44⊗⊗JMP⊗9F\cr
\\45⊗2H⊗SLAX⊗1⊗Normalize to left.\cr
46⊗⊗DEC2⊗1\cr
47⊗1H⊗CMPA⊗=0=(1:1)⊗Is the leading byte zero?\cr
48⊗⊗JE⊗2B\cr
\\49⊗⊗SRAX⊗2⊗(Rounding omitted)\cr
50⊗⊗STA⊗ACC\cr
\\51⊗⊗LDA⊗EXPO⊗Compute final exponent.\cr
52⊗⊗INCA⊗0,2\cr
53⊗⊗JAN⊗EXPUND⊗Is it negative?\cr
54⊗⊗STA⊗ACC(EXPD)\cr
\\55⊗⊗CMPA⊗=1(3:3)=⊗Is it more than two bytes?\cr
56⊗⊗JL⊗8F\cr
\\57⊗EXPOVD⊗HLT⊗20\cr
58⊗EXPUND⊗HLT⊗10\cr
\\59⊗8H⊗LDA⊗ACC⊗Bring answer into rA.\cr
60⊗9H⊗STX⊗ACCX\cr
61⊗EXITDF⊗JMP⊗*⊗Exit from subroutine.\cr
\\62⊗ARG⊗CON⊗0\cr
63⊗ARGX⊗CON⊗0\cr
\\64⊗ACC⊗CON⊗0⊗Floating-point accumulator\cr
65⊗ACCX⊗CON⊗0\cr
66⊗EXPO⊗CON⊗0⊗Part of ``raw exponent''\quad\blackslug\cr}}
%folio 322 galley 15-16 WARNING: Much tape unreadable. (C) Addison-Wesley 1978 *
\yyskip When the least-significant halves are added together
in this program, an extra digit ``1'' is inserted at the left
of the word that is known to have the correct sign. After the
addition, this byte can be 0, 1, or 2, depending on the circumstances,
and all three cases are handled simultaneously in this way.\xskip
(Compare this with the rather cumbersome method of complementation
that is used in Program 4.2.1A.)
It is worth noting that register A can be zero
after the instruction on line 40 has been performed; and, because
of the way \MIX\ defines the sign of a zero result, the accumulator
contains the correct sign that is to be attached to the result
if register X is nonzero. If lines 39 and 40 were interchanged,
the program would be incorrect, even though both instructions are
``\.{ADD}''!
\yskip Now let us consider double-precision multiplication.
The product has four components, shown schematically in Fig.\
4. Since we need only the leftmost eight bytes, it is convenient
to work only with the digits to the left of the vertical line
in the diagram, and this means in particular that we need not
even compute the product of the two least significant halves.
\algbegin Program M (Double-precision
multiplication). The input and output conventions for
this subroutine are the same as for Program A.
{\yyskip\tabskip10pt\mixfour{\!
01⊗BYTE⊗EQU⊗1(4:4)⊗Byte size\cr
02⊗QQ⊗EQU⊗BYTE*BYTE/2⊗Excess of double-precision exponent\cr
03⊗DFMUL⊗STJ⊗EXITDF⊗Double-precision multiplication:\cr
04⊗⊗STA⊗TEMP\cr
05⊗⊗SLAX⊗2⊗Remove exponent.\cr
06⊗⊗STA⊗ARG⊗$v↓m$\cr
07⊗⊗STX⊗ARGX⊗$v↓{\,l}$\cr
\\08⊗⊗LDA⊗TEMP(EXPD)\cr
09⊗⊗ADD⊗ACC(EXPD)\cr
10⊗⊗STA⊗EXPO⊗$\.{EXPO} ← e↓u + e↓v.$\cr
11⊗⊗ENT2⊗-QQ⊗$\rI2 ← -\.{QQ}$.\cr
\\12⊗⊗LDA⊗ACC\cr
13⊗⊗LDX⊗ACCX\cr
14⊗⊗SLAX⊗2⊗Remove exponent.\cr
15⊗⊗STA⊗ACC⊗$u↓m$\cr
16⊗⊗STX⊗ACCX⊗$u↓{\,l}$\cr
\\17⊗⊗MUL⊗ARGX⊗$u↓m \times v↓{\,l}$\cr
18⊗⊗STA⊗TEMP\cr
19⊗⊗LDA⊗ARG(ABS)\cr
20⊗⊗MUL⊗ACCX(ABS)⊗$|v↓m \times u↓{\,l}|$\cr
\\21⊗⊗SRA⊗1⊗$0\;x\;x\;x\;x$\cr
\\22⊗⊗ADD⊗TEMP(1:4)⊗(Overflow cannot occur)\cr
23⊗⊗STA⊗TEMP\cr
\\24⊗⊗LDA⊗ARG\cr
25⊗⊗MUL⊗ACC⊗$v↓m \times u↓m$\cr
26⊗⊗STA⊗TEMP(SIGN)⊗True sign of result\cr
\\27⊗⊗STA⊗ACC⊗Now prepare to add all the\cr
28⊗⊗STX⊗ACCX⊗\qquad partial products together.\cr
\\29⊗⊗LDA⊗ACCX(0:4)⊗$0\;x\;x\;x\;x$\cr
30⊗⊗ADD⊗TEMP⊗(Overflow cannot occur)\cr
31⊗⊗SRAX⊗4\cr
32⊗⊗ADD⊗ACC⊗(Overflow cannot occur)\cr
33⊗⊗JMP⊗DNORM⊗Normalize and exit.\quad\blackslug\cr}}
\topinsert{\hbox to size{\hfill\vbox{
\baselineskip0pt \lineskip0pt
\def\\{\vrule height 9.5pt depth 3pt}
\def\¬{\hbox to .4pt{\raise 9.5pt\null \lower 3pt\null}}
\halign{\hbox to 10pt{$\hfill#\hfill$}⊗\!
\hbox to 10pt{$\hfill#\hfill$}⊗\!
\hbox to 10pt{$\hfill#\hfill$}⊗\!
\hbox to 10pt{$\hfill#\hfill$}⊗\!
\hbox to 10pt{$\hfill#\hfill$}⊗\hskip 10pt
\hbox to 10pt{$\hfill#\hfill$}⊗\!
\hbox to 10pt{$\hfill#\hfill$}⊗\!
\hbox to 10pt{$\hfill#\hfill$}⊗\!
\hbox to 10pt{$\hfill#\hfill$}⊗\!
\hbox to 10pt{#\hfill}⊗\hskip 10pt
\hbox to 10pt{$\hfill#\hfill$}⊗\!
\hbox to 10pt{$\hfill#\hfill$}⊗\!
\hbox to 10pt{$\hfill#\hfill$}⊗\!
\hbox to 10pt{$\hfill#\hfill$}⊗\!
\hbox to 10pt{$\hfill#\hfill$}⊗\hskip 10pt
\hbox to 10pt{$\hfill#\hfill$}⊗\!
\hbox to 10pt{$\hfill#\hfill$}⊗\!
\hbox to 10pt{$\hfill#\hfill$}⊗\!
\hbox to 10pt{$\hfill#\hfill$}⊗\!
\hbox to 10pt{$\hfill#\hfill$}⊗\!
$\null#\hfill$\cr
⊗⊗⊗⊗⊗⊗⊗⊗⊗\¬⊗u⊗u⊗u⊗u⊗u⊗u⊗u⊗u⊗0⊗0⊗=u↓m+εu↓{\,l}\cr
⊗⊗⊗⊗⊗⊗⊗⊗⊗\¬⊗v⊗v⊗v⊗v⊗v⊗v⊗v⊗v⊗0⊗0⊗=v↓m+εv↓{\,l}\cr
\noalign{\moveright122.5pt\vbox{\hrule width 105pt}}
⊗⊗⊗⊗⊗⊗⊗⊗⊗\\⊗x⊗x⊗x⊗x⊗x⊗x⊗0⊗0⊗0⊗0⊗=ε↑2u↓{\,l}\times v↓{\,l}\cr
⊗⊗⊗⊗⊗x⊗x⊗x⊗x⊗\\\hfill$x$⊗x⊗x⊗x⊗0⊗0⊗⊗⊗⊗⊗⊗=εu↓m\times v↓{\,l}\cr
⊗⊗⊗⊗⊗x⊗x⊗x⊗x⊗\\\hfill$x$⊗x⊗x⊗x⊗0⊗0⊗⊗⊗⊗⊗⊗=εu↓{\,l}\times v↓m\cr
x⊗x⊗x⊗x⊗x⊗x⊗x⊗x⊗x⊗\\\hfill$x$⊗⊗⊗⊗⊗⊗⊗⊗⊗⊗⊗=u↓m\times v↓m\cr
\noalign{\moveright2.5pt\vbox{\hrule width 225pt}}
w⊗w⊗w⊗w⊗w⊗w⊗w⊗w⊗w⊗\\\hfill$w$⊗w⊗w⊗w⊗w⊗w⊗w⊗0⊗0⊗0⊗0\cr}}\hfill}
\yyskip
\ctrline{\caption Fig.\ 4. Double-precision
multiplication of eight-byte fraction parts.}}
\yyskip Note the careful treatment of signs in
this program, and note also the fact that the range of exponents
makes it impossible to compute the final exponent using an index
register. Program M is perhaps too slipshod in accuracy,
since it throws away all the information to the right of the
vertical line in Fig.\ 4; this can make the least significant
byte as much as 2 in error. A little more accuracy can be achieved
as discussed in exercise 4.
\yskip Double-precision floating division is the
most difficult routine, or at least the most frightening prospect
we have encountered so far in this chapter. Actually, it is
not terribly complicated, once we see how to do it; let us write
the numbers to be divided in the form $(u↓m + εu↓{\,l})/(v↓m + εv↓{\,l})$,
where $ε$ is the reciprocal of the word size of the computer,
and where $v↓m$ is assumed to be normalized. The fraction can
now be expanded as follows:
$$\baselineskip36pt
\eqalignno{u↓m + εu↓{\,l}\over v↓m + εv↓{\,l} ⊗= {u↓m + εu↓{\,l}\over
v↓m} \left(1\over 1 + ε(v↓{\,l}/v↓m)\right)\cr
⊗= {u↓m + εu↓{\,l}\over v↓m}\left( 1
- ε\left(v↓{\,l}\over v↓m\right) + ε↑2\left(v↓{\,l}\over v↓m\right)↑2 - \cdotss
\right).⊗(2)\cr}$$
Since $0 ≤ |v↓{\,l}| < 1$ and $1/b ≤ |v↓m| < 1$, we have
$|v↓{\,l}/v↓m| < b$, and the error from dropping terms involving $ε↑2$
can be disregarded. Our method therefore is to compute $w↓m
+ εw↓{\,l} = (u↓m + εu↓{\,l})/v↓m$, and then to subtract $ε$ times $w↓mv↓{\,l}/v↓m$
from the result.
In the following program, lines 27--32 do the lower
half of a double-precision addition, using another method for
forcing the appropriate sign as an alternative to the trick
of Program A.
\algbegin Program D (Double-precision division).
This program adheres to the same conventions as Programs A and
M.
{\yyskip\tabskip 3pt\mixfour{\!
01⊗DFDIV⊗STJ⊗EXITDF⊗Double-precision division:\cr
02⊗⊗JOV⊗OFLO⊗Ensure overflow is off.\cr
03⊗⊗STA⊗TEMP\cr
04⊗⊗SLAX⊗2⊗Remove exponent.\cr
05⊗⊗STA⊗ARG⊗$v↓m$\cr
06⊗⊗STX⊗ARGX⊗$v↓{\,l}$\cr
\\07⊗⊗LDA⊗ACC(EXPD)\cr
08⊗⊗SUB⊗TEMP(EXPD)\cr
09⊗⊗STA⊗EXPO⊗$\.{EXPO}←e↓u-e↓v$.\cr
10⊗⊗ENT2⊗QQ+1⊗$\rI2←\.{QQ}+1$.\cr
\\11⊗⊗LDA⊗ACC\cr
12⊗⊗LDX⊗ACCX\cr
13⊗⊗SLAX⊗2⊗Remove exponent.\cr
\\14⊗⊗SRAX⊗1⊗(Cf.\ Algorithm 4.2.1M)\cr
15⊗⊗DIV⊗ARG⊗If overflow, it is detected below.\cr
16⊗⊗STA⊗ACC⊗$w↓m$\cr
\\17⊗⊗SLAX⊗5⊗Use remainder in further division.\cr
18⊗⊗DIV⊗ARG\cr
19⊗⊗STA⊗ACCX⊗$\pm w↓{\,l}$\cr
\\20⊗⊗LDA⊗ARGX(1:4)\cr
21⊗⊗ENTX⊗0\cr
22⊗⊗DIV⊗ARG(ABS)⊗$\rA←\lfloor|b↑4v↓{\,l}/v↓m|\rfloor/b↑5$.\cr
\\23⊗⊗JOV⊗DVZROD⊗Did division cause overflow?\cr
24⊗⊗MUL⊗ACC(ABS)⊗$\rAX←|w↓mv↓{\,l}/bv↓m|$, approximately.\cr
\\25⊗⊗SRAX⊗4⊗Multiply by $b$, and save\cr
26⊗⊗SLC⊗5⊗\qquad the leading byte in rX.\cr
27⊗⊗SUB⊗ACCX(ABS)⊗Subtract $|w↓{\,l}|$.\cr
\\28⊗⊗DECA⊗1⊗Force minus sign.\cr
29⊗⊗SUB⊗WM1\cr
\\30⊗⊗JOV⊗*+2⊗If no overflow, carry one more\cr
31⊗⊗INCX⊗1⊗\qquad to upper half.\cr
\\32⊗⊗SLC⊗5⊗(Now $\rA≤0$)\cr
33⊗⊗ADD⊗ACC(ABS)⊗$\rA←|w↓m|-|\rA|$.\cr
34⊗⊗STA⊗ACC(ABS)⊗(Now $\rA≥0$)\cr
\\35⊗⊗LDA⊗ACC⊗$w↓m$ with correct sign\cr
36⊗⊗JMP⊗DNORM⊗Normalize and exit.\cr
\\37⊗DVZROD⊗HLT⊗30⊗Unnormalized or zero divisor\cr
\\38⊗1H⊗EQU⊗1(1:1)\cr
39⊗WM1⊗CON⊗1B-1,BYTE-1(1:1)⊗Word size minus one\quad\blackslug\cr}}
\yyskip Here is a table of the approximate average computation times for these
double-precision subroutines, compared to the single-precision
subroutines that appear in Section 4.2.1:
$$\vbox{\halign{#\hfill⊗\qquad\hfill#⊗#\hfill⊗\qquad\hfill#⊗#\hfill\cr
⊗Single p⊗recision⊗Double p⊗recision\cr
\noalign{\vskip 3pt}
Addition⊗45⊗.5$u$⊗84⊗$u$\cr
Subtraction⊗49⊗.5$u$⊗88⊗$u$\cr
Multiplication⊗48⊗$u$⊗109⊗$u$\cr
Division⊗52⊗$u$⊗126⊗.5$u$\cr}}$$
For extension of the methods of this section to triple-precision floating-point
fraction parts, see Y. Ikebe, {\sl CACM \bf8} (1965), 175--177.
\exbegin{EXERCISES}
\exno 1. [16] Try the double-precision division technique by hand, with $ε = {1\over
1000}$, when dividing 180000 by 314159.\xskip $\biglp$Thus, let $(u↓m,
u↓{\,l}) = (.180, .000)$ and $(v↓m, v↓{\,l}) = (.314, .159)$, and find
the quotient using the method suggested in the text following
(2).$\bigrp$
\exno 2. [20] Would it
be a good idea to insert the instruction ``\.{ENTX 0}'' between
lines 30 and 31 of Program M\null, in order to keep unwanted information
left over in register X from interfering with the accuracy of
the results?
\exno 3. [M20] Explain
why overflow cannot occur during Program M.
\exno 4. [22] How should Program M be changed so that extra
accuracy is achieved, essentially by moving the vertical line
in Fig.\ 4 over to the right one position? Specify all changes
that are required, and determine the difference in execution
time caused by these changes.
\trexno 5. [24] How should Program A be changed so that extra
accuracy is achieved, essentially by working with a nine-byte
accumulator instead of an eight-byte accumulator to the right
of the decimal point? Specify all changes that are required,
and determine the difference in execution time caused by these
changes.
\exno 6. [23] Assume that
the double-precision subroutines of this section and the single-precision
subroutines of Section 4.2.1 are being used in the same main
program. Write a subroutine that converts a single-precision
floating-point number into double-precision form (1), and write
another subroutine that converts a double-precision floating-point
number into single-precision form (reporting exponent
overflow or underflow if the conversion is impossible).
\trexno 7. [M30] Estimate
the accuracy of the double-precision subroutines in this section,
by finding bounds $\delta ↓1$, $\delta ↓2$, and $\delta ↓3$ on
the relative errors
$$\def\\{\hbox{\:u\char'152}}
\vbox{\baselineskip15pt\halign{\hfill$#$\hfill\cr
\\\biglp (u \oplus v) - (u + v)\bigrp/(u + v)\\,\qquad
\\\biglp (u \otimes v) - (u \times v)\bigrp /(u \times v)\\,\cr
\\\biglp (u \odiv v) - (u/v)\bigrp /(u/v)\\.\cr}}$$
\exno 8. [M28] Estimate the accuracy
of the ``improved'' double-precision subroutines of exercises
4 and 5, in the sense of exercise 7.
\exno 9. [M42] T. J. Dekker [{\sl Numer.\ Math.\ \bf18} (1971), 224--242]
has suggested an alternative approach to double precision, based entirely
on single-precision floating-binary calculations. For example, Theorem
4.2.2C states that $u+v=w+r$, where $w=u\oplus v$ and $r=(u\ominus w)\oplus v$,
if $|u|≥|v|$ and the radix is 2; here $|r|≤|w|/2↑p$, so the pair $(w,r)$
may be considered a double-precision version of $u+v$. To add two such
pairs $(u,u↑\prime)\oplus(v,v↑\prime)$, where $|u↑\prime|≤|u|/2↑p$ and
$|v↑\prime|≤|v|/2↑p$ and $|u|≥|v|$, Dekker suggests computing
$u+v=w+r$ (exactly), then $s=(r\oplus v↑\prime)\oplus u↑\prime$ (an
approximate remainder), and finally returning the value $\biglp w\oplus s,\,
(w\ominus(w\oplus s))\oplus s\bigrp$.
Study the accuracy and efficiency of this approach when it is used recursively
to produce quadruple-precision calculations.
%folio 328 galley 17 (C) Addison-Wesley 1978 *
\runningrighthead{DISTRIBUTION OF FLOATING-POINT NUMBERS}
\section{4.2.4}
\sectionskip
\sectionbegin{4.2.4. Distribution of Floating-Point Numbers}
In order to analyze the average behavior of floating-point arithmetic
algorithms (and in particular to determine their average running time),
we need some statistical information that allows us to determine how
often various cases arise. The purpose of this section is to discuss the
empirical and theoretical properties of the distribution of floating-point
numbers.
\subsectionbegin{A. Addition and subtraction routines}
The execution time for a floating-point addition or subtraction depends largely
on the initial difference of exponents, and also on the number of
normalization steps required (to the left or to the right). No way is known
to give a good theoretical model that tells what characteristics to expect,
but extensive empirical investigations have been made by D. W. Sweeney
[{\sl IBM Systems J. \bf4} (1965), 31--42].
By means of a special tracing routine, Sweeney ran six ``typical'' large-scale
numerical programs, selected from several different computing laboratories,
and examined each floating addition or subtraction operation very carefully. Over
250,000 floating-point addition-subtractions were involved in gathering this
data. About one out of every ten instructions executed by the tested programs
was either \.{FADD} or \.{FSUB}.
Let us consider subtraction to be addition preceded by negating the second
operand; therefore we may give all the statistics as if we were merely doing
addition. Sweeney's results can be summarized as follows:
One of the two operands to be added was found to be equal to zero about 9 percent
of the time, and this was usually the accumulator (\.{ACC}). The other 91 percent
of the cases split about equally between operands of the same or of opposite signs,
and about equally between cases where $|u|≤|v|$ or $|v|≤|u|$. The computed answer
was zero about 1.4 percent of the time.
The difference between exponents had a behavior approximately
given by the probabilities in the following table, for various
radices $b$:
$$\vbox{\ninepoint\baselineskip0pt\lineskip0pt
\def\\{\vrule height 8.75pt depth 2.25pt}
\def\¬{\vrule height 2pt}
\halign{\quad\hfill#\hfill\quad⊗#⊗\quad\hfill#\hfill
⊗\qquad\hfill#\hfill⊗\qquad\hfill#\hfill⊗\qquad\hfill#\hfill\cr
$|e↓u-e↓v|$⊗\\⊗$b=2$⊗$b=10$⊗$b=16$⊗$b=64$\cr
⊗\¬\cr
\noalign{\hrule}
⊗\¬\cr
0⊗\\⊗0.33⊗0.47⊗0.47⊗0.56\cr
1⊗\\⊗0.12⊗0.23⊗0.26⊗0.27\cr
2⊗\\⊗0.09⊗0.11⊗0.10⊗0.04\cr
3⊗\\⊗0.07⊗0.03⊗0.02⊗0.02\cr
4⊗\\⊗0.07⊗0.01⊗0.01⊗0.02\cr
5⊗\\⊗0.04⊗0.01⊗0.02⊗0.00\cr
over 5⊗\\⊗0.28⊗0.13⊗0.11⊗0.09\cr
⊗\¬\cr
\noalign{\hrule}
⊗\¬\cr
average⊗\\⊗3.1⊗0.9⊗0.8⊗0.5\cr}}$$
(The ``over 5'' line includes essentially all
of the cases when one operand is zero, but the ``average'' line does
not include these cases.)
When $u$ and $v$ have the same sign and are normalized,
then $u + v$ either requires one shift to the {\sl right} (for
fraction overflow), or no normalization shifts whatever. When
$u$ and $v$ have opposite signs, we have zero or more {\sl left} shifts
during the normalization. The following table gives the observed
number of shifts required:
$$\vbox{\ninepoint\baselineskip0pt\lineskip0pt
\def\\{\vrule height 8.75pt depth 2.25pt}
\def\¬{\vrule height 2pt}
\halign{\quad#\hfill\quad⊗#⊗\quad\hfill#\hfill
⊗\qquad\hfill#\hfill⊗\qquad\hfill#\hfill⊗\qquad\hfill#\hfill\cr
⊗\\⊗$b=2$⊗$b=10$⊗$b=16$⊗$b=64$\cr
⊗\¬\cr
\noalign{\hrule}
⊗\¬\cr
Shift right 1⊗\\⊗0.20⊗0.07⊗0.06⊗0.03\cr
No shift⊗\\⊗0.59⊗0.80⊗0.82⊗0.87\cr
Shift left 1⊗\\⊗0.07⊗0.08⊗0.07⊗0.06\cr
Shift left 2⊗\\⊗0.03⊗0.02⊗0.01⊗0.01\cr
Shift left 3⊗\\⊗0.02⊗0.00⊗0.01⊗0.00\cr
Shift left 4⊗\\⊗0.02⊗0.01⊗0.00⊗0.01\cr
Shift left >4⊗\\⊗0.06⊗0.02⊗0.02⊗0.02\cr}}$$
The last line includes all cases where the
result was zero. The average number of left shifts per normalization
was about 0.9 when $b = 2$; about 0.2 when $b = 10$ or 16; and about 0.1
when $b = 64$.
\subsectionbegin{B. The fraction parts} Further analysis
of floating-point routines can be based on the {\sl statistical
distribution of the fraction parts} of randomly chosen normalized
floating-point numbers. In this case the facts are quite surprising,
and there is an interesting theory that accounts for the unusual
phenomena that are observed.
For convenience let us temporarily assume that
we are dealing with floating-{\sl decimal} (i.e., radix 10)
arithmetic; modifications of the following discussion to any
other positive integer base $b$ will be very straightforward.
Suppose we are given a ``random'' positive normalized number
$(e, f) = 10↑e \cdot f$. Since $f$ is normalized, we know that
its leading digit is 1, 2, 3, 4, 5, 6, 7, 8, or 9, and it seems
natural to assume that each of these nine possible leading digits will occur
about one-ninth of the time. But, in fact, the behavior in practice
is quite different. For example, the leading digit tends to
be equal to 1 over 30 percent of the time!
One way to test the assertion just made is to take
a table of physical constants (e.g., the speed of light, the
acceleration of gravity) from some standard reference. If we look at
the {\sl Handbook of Mathematical Functions} (U.S. Dept of Commerce,
1964), for example, we find that 8 of the 28 different physical
constants given in Table 2.3, roughly 29 percent, have leading
digit equal to 1. The decimal values of $n!$ for $1 ≤ n ≤ 100$
include exactly 30 entries beginning with 1; so do the decimal
values of $2↑n$ and of $F↓n$, for $1 ≤ n ≤ 100$. We might also
try looking at census reports, or a Farmer's Almanack (but
not a telephone directory).
In the days before pocket calculators, the pages
in well-used tables of log\-a\-rithms tended to get quite dirty
in the front, while the last pages stayed relatively clean and
neat. This phenomenon was apparently first mentioned in print by the
American astronomer Simon Newcomb [{\sl Amer.\ J. Math. \bf 4}
(1881), 39--40], who gave good grounds for believing that the
leading digit $d$ occurs with probability $\log↓{10}(1 + 1/d)$.
The same distribution was discovered empirically, many years
later, by Frank Benford, who reported the results of 20,229
observations taken from different sources [{\sl Proc.\ Amer.\
Philosophical Soc.\ \bf 78} (1938), 551--572].
In order to account for this leading-digit law,
let's take a closer look at the way we write numbers in floating-point
notation. If we take any positive number $u$, its leading digits
are determined by the value $(\log↓{10} u)\mod 1$: The leading
digit is less than $d$ if and only if
$$(\log↓{10} u)\mod 1 < \log↓{10} d,\eqno (1)$$
since $10\,f↓u = 10↑{(\log↓{10} u)\mod 1}$.
Now if we have any ``random'' positive number $U$,
chosen from sone reasonable distribution such as might occur
in nature, we might expect that $(\log↓{10} U)\mod 1$ would be
uniformly distributed between zero and one, at least to a very
good approximation.\xskip (Similarly, we expect $U \mod 1$, $U↑2 \mod
1$, $\sqrt{U + π} \mod 1$, etc., to be uniformly distributed.
We expect a roulette wheel to be unbiased, for essentially the
same reason.)\xskip Therefore by (1) the leading digit will be 1 with
probability $\log↓{10} 2 \approx 30.103$ percent; it will be 2
with probability $\log↓{10} 3 - \log↓{10} 2 \approx 17.609$ percent;
and, in general, if $r$ is any real value between 1 and 10, we
ought to have $10\,f↓U ≤ r$ approximately $\log↓{10} r$ of the
time.
Another way to explain this law is to say that
a random value $U$ should appear at a random point on a slide
rule, according to the uniform distribution, since the distance
from the left end of a slide rule to the position of $U$ is
proportional to $(\log↓{10} U)\mod 1$. The analogy between slide
rules and floating-point calculation is very close when multiplication
and division are being considered.
The fact that leading digits tend to be small is
important to keep in mind; it makes the most obvious techniques
of ``average error'' estimation for floating-point calculations
invalid. The relative error due to rounding is usually a little
more than expected.
Of course, it may justly be said that the heuristic
argument above does not prove the stated law. It merely shows
us a plausible reason why the leading digits behave the way
they do. An interesting approach to the analysis of leading
digits has been suggested by R. Hamming: Let $p(r)$ be the probability
that $10\,f↓U ≤ r$, where $1 ≤ r ≤ 10$ and $f↓U$ is the normalized
fraction part of a random normalized floating-point number $U$.
If we think of random quantities in the real world, we observe
that they are measured in terms of arbitrary units; and if we
were to change the definition of a meter or a gram, many of
the fundamental physical constants would have different values.
Suppose then that all of the numbers in the universe are suddenly
multiplied by a constant factor $c$; our universe of random
floating-point quantities should be essentially unchanged by
this transformation, so $p(r)$ should not be affected.
%folio 330 galley 18 (C) Addison-Wesley 1978 *
Multiplying everything by $c$ has the effect of transforming $(\log↓{10} U)\mod
1$ into $(\log↓{10} U +\log↓{10} c)\mod 1$. It is now time to set
up formulas that describe the desired behavior; we may
assume that $1 ≤ c ≤ 10$. By definition,
$$p(r) =\hbox{probability that }(\log↓{10} U)\mod 1 ≤ \log↓{10}r.$$
By our assumption, we should also have
$$\eqalignno{p(r)⊗ =\hbox{probability that }(\log↓{10} U + \log↓{10}
c)\mod 1 ≤ \log↓{10} r\cr
\noalign{\vskip6pt}
⊗=\left\{\lpile{\hbox{probability that $(\log↓{10}U\mod1)≤\log↓{10}r
-\log↓{10}c$}\cr
\noalign{\vskip 3pt}
\hbox to 45pt{\hfill or}
\hbox{ $(\log↓{10}U\mod1)≥1-\log↓{10}c$,\qquad if $c≤r$;}\cr
\noalign{\vskip 6pt}
\hbox{probability that $(\log↓{10}U\mod1)≤\log↓{10}r+1-\log↓{10}c$}\cr
\noalign{\vskip 3pt}
\hbox to 45pt{\hfill and}
\hbox{ $(\log↓{10}U\mod1)≥1-\log↓{10}c$,\qquad if $c≥r$;}\cr}\right.\cr
\noalign{\vskip6pt}
⊗=\left\{\vcenter{\halign{$#\hfill$\qquad⊗if $#$\hfill\cr
p(r/c)+1-p(10/c),⊗c≤r;\cr
\noalign{\vskip2pt}
p(10r/c)-p(10/c),⊗c≥r.\cr}}\right.⊗(2)\cr}$$
Let us now extend the function $p(r)$ to values outside
the range $1 ≤ r ≤ 10$, by defining $p(10↑nr) = p(r) + n$; then
if we replace $10/c$ by $d$, the last equation of (2) may be
written
$$p(rd) = p(r) + p(d).\eqno (3)$$
If our assumption about invariance of
the distribution under multiplication by a constant factor is
valid, then Eq.\ (3) must hold for all $r > 0$ and $1 ≤ d ≤ 10$.
The facts that $p(1) = 0$, $p(10) = 1$ now imply that
$$\def\\{(\spose{\raise7pt\hbox{\hskip2pt$\scriptstyle n$}}\sqrt{10}\,)}
1 = p(10) = p\biglp \\↑n\bigrp = p\\
+ p\biglp \\↑{n-1}\bigrp =\cdots = np\\;$$
hence we deduce that $p(10↑{m/n}) = m/n$ for all
positive integers $m$ and $n$. If we now decide to require that
$p$ is continuous, we are forced to conclude that $p(r) = \log↓{10}
r$, and this is the desired law.
Although this argument may be more convincing
than the first one, it doesn't really hold up under scrutiny
if we stick to conventional notions of probability. The traditional
way to make the above argument rigorous is to assume that there
is some underlying distribution of numbers $F(u)$ such that
a given positive number $U$ is $≤u$ with probability $F(u)$;
then the probability of concern to us is
$$\chop to 9pt{p(r) =\sum ↓{m}\,\biglp F(10↑mr) - F(10↑m)\bigrp},\eqno (4)$$
summed over all values $-∞ < m < ∞$. Our assumptions
about scale invariance and continuity have led us to conclude that
$$p(r) = \log↓{10} r.$$
Using the same argument, we could ``prove'' that
$$\chop to 9pt{\sum↓{m}\,\biglp F(b↑mr) - F(b↑m)\bigrp = \log↓b r},\eqno (5)$$
for each integer $b ≥ 2$, when $1 ≤ r
≤ b$. But there {\sl is} no distribution function $F$ that satisfies
this equation for all such $b$ and $r$! (See exercise 7.)
One way out of the difficulty is to regard the
logarithm law $p(r) = \log↓{10} r$ as only a very close {\sl
approximation} to the true distribution. The true distribution
itself may perhaps be changing as the universe expands, becoming
a better and better approximation as time goes on; and if we
replace 10 by an arbitrary base $b$, the approximation might
be less accurate (at any given time) as $b$ gets larger.
Another rather appealing way to resolve the dilemma, by abandoning
the traditional idea of a distribution function, has been suggested
by R. A. Raimi, {\sl AMM \bf 76} (1969), 342--348.
The hedging in the last paragraph is probably a
very unsatisfactory explanation, and so the following further
calculation (which sticks to rigorous mathematics and avoids
any intuitive, yet paradoxical, notions of probability) should
be welcome. Let us consider the distribution of the leading
digits of the {\sl positive integers}, instead of the distribution
for some imagined set of real numbers. The investigation of
this topic is quite interesting, not only because it sheds some
light on the probability distributions of floating-point data,
but also because it makes a particularly instructive example
of how to combine the methods of discrete mathematics with the
methods of infinitesimal calculus.
In the following discussion, let $r$ be a fixed
real number, $1 ≤ r ≤ 10$; we will attempt to make a reasonable
definition of $p(r)$, the ``probability'' that the representation
$10↑{e↓N}\cdot f↓N$ of a ``random'' positive integer $N$
has $10\,f↓N < r$, assuming infinite precision.
To start, let us try to find the probability using
a limiting method like the definition of ``Pr'' in Section 3.5.
One nice way to rephrase that definition is to define
$$P↓0(n)=\left\{\lpile{1,\qquad\hbox{if $n=10↑e\cdot f$ where $10\,f<r$,}\cr
\qquad\qquad\qquad\hbox{i.e., if $(\log↓{10}n)\mod1<\log↓{10}r$;}\cr
0,\qquad\hbox{otherwise.}\cr}\right.\eqno(6)$$
Now $P↓0(1)$, $P↓0(2)$, $\ldots$ is an infinite sequence of
zeros and ones, with ones to represent the cases that contribute
to the probability we are seeking. We can try to ``average out'' this sequence,
by defining
$$P↓1(n) = {1\over n}\sum ↓{1≤k≤n} P↓0(k).\eqno (7)$$
Thus if we generate a random integer between 1
and $n$ using the techniques of Chapter 3, and convert it to
floating-decimal form $(e, f)$, the probability that $10\,f <
r$ is exactly $P↓1(n)$. It is natural to let $\lim↓{n→∞} P↓1(n)$
be the ``probability'' $p(r)$ we are after, and that is just
what we did in Section 3.5.
%folio 333 galley 19 (C) Addison-Wesley 1978 *
But in this case the limit does not exist: For
example, let us consider the subsequence
$$P↓1(s),\, P↓1(10s),\, P↓1(100s),\, \ldotss ,\, P↓1(10↑ns),\, \ldotss,$$
where $s$ is a real number, $1≤s≤10$. If $s≤r$, we find that
$$\eqalignno{P↓1(10↑ns)⊗={1\over 10↑n\,s}\biglp\lceil r\rceil - 1 + \lceil
10r\rceil - 10 + \cdots + \lceil 10↑{n-1}r\rceil\cr
⊗\hskip90pt\null - 10↑{n-1}+ \lfloor 10↑ns\rfloor + 1 - 10↑n\bigrp\cr
\noalign{\vskip4pt}
⊗={1\over 10↑n\,s} \biglp r(1 + 10 + \cdots +
10↑{n-1}) + O(n)\cr
⊗\hskip90pt\null + \lfloor 10↑ns\rfloor - 1 - 10 -\cdots - 10↑n\bigrp\cr
\noalign{\vskip4pt}
⊗={1\over 10↑n\,s}\textstyle \biglp {1\over 9}(10↑nr
- 10↑{n+1}) + \lfloor 10↑ns\rfloor + O(n)\bigrp .⊗(8)\cr}$$
As $n → ∞$, $P↓1(10↑ns)$ therefore approaches the limiting value
$1 + (r - 10)/9s$. The above calculation for the case $s ≤ r$
can be modified so that it is valid for $s > r$ if we replace
$\lfloor 10↑ns\rfloor + 1$ by $\lceil 10↑nr\rceil $; when
$s ≥ r$, we therefore obtain the limiting value $10(r - 1)/9s$.\xskip [See
J. Franel, {\sl Naturforschende Gesellschaft, Vierteljahrsschrift
\bf 62} (Z\"urich, 1917), 286--295.]
In other words, the sequence $\langle P↓1(n)\rangle$
has subsequences $\langle P↓1(10↑ns)\rangle$ whose
limit goes from $(r - 1)/9$ up to $10(r - 1)/9r$ and down again
to $(r - 1)/9$, as $s$ goes from 1 to $r$ to 10. We see that
$P↓1(n)$ has no limit as $n→∞$; and the values of $P↓1(n)$ for large $n$
are not particularly good
approximations to our conjectured limit $\log↓{10} r$ either!
Since $P↓1(n)$ doesn't approach a limit, we can
try to use the same idea as (7) once again, to ``average out''
the anomalous behavior. In general, let
$$\chop to 9pt{P↓{m+1}(n) = {1\over n}\sum ↓{1≤k≤n} P↓m(k)}.\eqno (9)$$
Then $P↓{m+1}(n)$ will tend to be a more
well-behaved sequence than $P↓m(n)$. Let us try to confirm this
with quantitative calculations; our experience
with the special case $m = 0$ indicates that it might be worth
while to consider the subsequence $P↓{m+1}(10↑ns)$. The
following results can, in fact, be derived:
\thbegin Lemma Q. {\sl For any integer $m ≥ 1$ and any real
number $ε > 0$, there are functions $Q↓m(s)$, $R↓m(s)$ and an integer
$N↓m(ε)$, such that whenever $n > N↓m(ε)$ and $1 ≤ s ≤ 10$, we have
$$\cpile{|P↓m(10↑ns) - Q↓m(s)| < ε,\qquad\hbox{if $s ≤ r$;}\cr\noalign{\vskip3pt}
|P↓m(10↑ns) - \biglp Q↓m(s) + R↓m(s)\bigrp | <
ε,\qquad\hbox{if $s > r$.}\cr}\eqno(10)$$
Furthermore the functions $Q↓m(s)$ and $R↓m(s)$ satisfy
the relations}
$$\eqalignno{Q↓m(s)⊗ = {1\over s}\bigglp{1\over 9} \int ↑{10}↓{1}Q↓{m-1}(t)
\,dt + \int ↑{s}↓{1}Q↓{m-1}(t)\,dt + {1\over 9} \int ↑{10}↓{r}
R↓{m-1}(t)\, dt\biggrp;\cr
\noalign{\vskip4pt}
R↓m(s) ⊗= {1\over s} \int ↑{s}↓{r}R↓{m-1}(t)\,dt;⊗(11)\cr
\noalign{\vskip4pt}
Q↓0(s) ⊗= 1,\qquad R↓0(s) = -1.\cr}$$
\def\\{\raise17.5pt\null\lower12.5pt\null} % formula of 30points in size
\dproofbegin Consider the functions $Q↓m(s)$ and $R↓m(s)$
defined by (11), and let
$$S↓m(t)=\left\{\vcenter{\halign{$#\hfill\qquad$⊗$#\hfill$\cr
Q↓m(t),⊗t ≤ r.\cr\noalign{\vskip3pt}
Q↓m(t) + R↓m(t),⊗t > r.\cr}}\right.\eqno(12)$$
We will prove the lemma by induction on $m$.
First note that $Q↓1(s) = \biglp
1 + (s - 1) + (r - 10)/9\bigrp /s = 1 + (r - 10)/9s$, and $R↓1(s)
= (r - s)/s$. From (8) we find that
$|P↓1(10↑ns) - S↓1(s)| = O(n)/10↑n$;
this establishes the lemma when $m = 1$.
Now for $m > 1$, we have $$\\\chop to 9pt{\hbox to size{$\dispstyle P↓m(10↑ns) =
{1\over s}\bigglp\sum ↓{0≤j<n}\≤{1\over 10↑{n-j}}
\≤\sum ↓{10↑j≤k<10↑{j+1}}\≤ {1\over 10↑j} P↓{m-1}(k) +\≤\≤\≤ \sum ↓{10↑n≤k≤10↑ns}
\≤{1\over 10↑n}P↓{m-1}(k)\biggrp,$}}$$
and we want to approximate this quantity. By induction,
the difference
$$\\\left|\chop to 9pt{ \sum ↓{10↑j≤k≤10↑jq} {1\over 10↑j} P↓{m-1}(k) - \≤\≤\sum
↓{10↑j≤k≤10↑jq} {1\over 10↑j} S↓{m-1}\bigglp{k\over 10↑j}\biggrp}\right|
\eqno (13)$$
is less than $p@2!β←#.q↓⊃E↓aβEb↓EA⊃ε;⊃↓&Q↓yαpo55∂q →%"p4*ONs∂∃↓%_πo5ky#QJ!β'Mε≠?;SNsW?W~aβ'QεKM⬬∪'↔7∞s97'w#↔∨K∞∪3∀4V3W;∂&K?9mε;⊃β&C∃β∪N3≠↔K.s∂∀4R!∩rrfc↔≠Sec∂#?αβS=↓OβSns∨+5π[ B{(fXqEBvSGyβ[
s?[/⊃↓EBvSyαL∂[55Guc'∨>cCo.f{[↔HhQEB{Wrs'>;KA↓jαs';"α{oGpmGy¬_πo5ky#QMa1β∪'rsK'>CSqαf+G;=αAEQ%" 4+'~β3↔O~βS#πr↓⊂→⊃ε3?Iβ∞c1↓∪R!β∨K.S↔Iπ##π9π≠?7∀hS;W7⊗+I↓∩r!1β'v#↔C↔v#↔;Qε{→↓∪
!1βJβS#∃ε#↔≠'vKS'?rβ?→βNsS↔∨⊗S'?rp4*←*β7πeε≠#??≡)↓∩9"βS=β⊗)↓⊃zpo55∂q →%"qαS#/∪↔≠?⊗)β≠?∩↓∪9↓rα9⊃0hSS#∃ε#'≠≠/∪↔;∂(h)⊃∩ebs↔G∞c'∨;v{lZsf+≠Srf≠#?Aπ#=↓gπ#n@πjAEB{w→%7m
c?[↔∩βOzs⊗K∨∨3¬cOW4∂Y@s)fs{mFf{[↔HhQEB{↑q7+{uc'∨?∪AαsNsQα{[ Cxπ[yαL∂[55GrCQ&qf#Q↓-¬c';Q¬soOx∂YGyα_o55∂q#Q&b`4+∪%c'∨?∪Czs⊗K∨#Sa1!EUMc∂Ky" 4+'~β?Wv#↔⊃β↔H4)∩g≠W5∂Y@s(dsy↓"jyEB{↑q7+yJ↓-αs∨+5π\qs)swq↓!Eλ1<4)α{o9nSy%↓Z↓ED→b 4+'2↓∩5⊃εKMβπrβWCC/⊃β?.s⊃β≠␈⊃↓⊃!→%↓-αAEQ%"βS#π h+'Mπ3π3'"β≠?Iε31↓&Q⊃9α6K;π3gI1βSF)βOWj↓∩sO.iπm↓c)s;r↓!E=α{o9nSy%⊃`h+←#N≠!βππβ↔πK~β'9↓C U%1εKMβ↔∂+π1β&y↓⊃!
↓5↓E{ B{9Jye⊃mπ≠<4)""s3↔7#rs∂F{AβSz↓gCS]π5!α{;MJ↓5βm
c?[↔∩βOzs⊗K∨∨3πYFs?6+I↓guc';RwYECx
LπojiGy#"H4*qf#Q↓-¬c';Q¬soOx∂YGzL∂[55GrCQ&qf#RsN;∨KCucK'∨G#q⊃⊂hS∂π9ε∪∃β7∞#∃βOn33↔∩βS#πraβOπJa↓⊃I1⊃04VK→↓∪r!β'Mπ#π/↔rβ3πK>)β↔;␈+∨!9∧≠?7C∂∪';≥π##'Mπ;'S!αAEA%ε;⊃↓C E$4V≠?7Cf+S↔Mπ##∃βπ∪??→ucGWπ%c3π≡[O3W8h(1↔≠}c'=↓≠→Yβ∨∞c3↔eβ⊃A↓"~Iαπ∪&KO?9m;↔O3/I↓Ee;@%(4UcggO↑KAαSF)β∨'∨!β?→∧c↔77
αEβ'~βS#π"β←∃βF[∃β&C∀4+fK7'SNs≥βK.cπS'}sO#'h)⊃∩fc'4π↑pd;y¬π5!α{;MJ↓uαL∞i#M%uc↔G;z↓!EYJ!⊂4*∞cO=1π≠';∂*↓∩LπjCM%⊃εKMβ;␈!β∂?w≠Sπ;"βπM↓'→⊃β[∂∪'↔MbβS#∃εc'7' h)⊃∩fc'4π↑pd;y¬π5#rI⊃⊂4RC←#'≡Aβ←?.c⊃β*β?WIε#↔O'⊗+⊃ββπβK?∞∪'3''I≥≥%ε#?↔LhS;?Qε+c'O"β≠?Iε;e↓&i⊃9α&C∃βOO#WπSN{9β'~βO#?>qβ'9∧3'≥:b↓U1β>C'∂ hSO#??→βS#*β[π3.+Mβ?2↓∩LπjCM%⊃π;#↔9α#5⊃βO→βO7∞c1βπv!↓∪Iβi↓I⊃ph(4*/3↔9β&C?W∨B↓∩LπjCM%⊃εKMβ;␈!β¬β≡{;OS∞sQ1β≡yβS#∂ 4+←*β∪=βv{Qβ#∂3∃β¬ε#↔≠'vKS∃βfK7'Qε3?I↓%π5#rI⊃1βv{S∃β&CπQβ∞cK↔π'H4+≠␈⊃↓∪5βi↓M⊃εK9α≠N9:q↓*βS#∃π3π3W*β?→↓%_π5#~I⊃βO&gMβ6+Keβ≡c?O∀hSS=↓%c3?≤∂YECyβ⊃↓u↓αqMAEβ→αs3&{SO6ba⊃9α&C↔K↔6{K∃β>)β#π6)β∨?}!βK↔∂≠?84W#=βO/≠C↔∂"βS#π"↓∩LπjCM%⊃εKMβ[/∪eβ∂f{O∃β&y↓∩sf{≤πmβyβI"β≠?Iε304VcπK∨*↓∪5⊃bβπ;⊃bβ'9β6∂Q1π##πQπ##∃β≡+GW↔v≠∃β?2β≠W;∨#'?;~↓∩s3∞s∨3∀hRLπ5G→&sK∞s∨3∃"β∂?;6+K∨↔~βW;'6{K73JβS=β&C∃β∂}sOSπw!β≠Wv≠S'?ph)∩sf{≤πmβyβI"p4(4TKQβ'~β';S/∪↔OSNs≥βSzβCK?6)βS#O→β∂?vS↔∂S/∪∃βJβ↔cCfK∂'SgH4+∂∞c∂W3∂#';≥α"Dπ5G→%⊃β∞s⊃↓∩⊂5#MJ!β≠?∩βπ31α#5⊃1εMβ'ph+S#*βCK?}1β?→π##∃β6{33?>K;≥β&C↔?K.ih4(hRsS?εK;O↔↔#ns[⊗{cns7≠/'Aβ)W74hRs∂S⊗c';↔]c∂πC&K?9α6K≥:qβ)9αSF)βCK}∪π'fKSeβ&CπQβ&C∃β3.∪';:β∪'∨O!β'Mβ ;{{ph(4*g##↔>K9αSF+?K↔jα→9β]cO1αf+Q↓∩_5#MJ!β∃π##∃βfK7'Qε#↔≠'v+⊂4+Nq↓⊃!1%⊃9∧3?Iβ∞c1↓⊂2↓y↓A"aβS#/∪∃β↔FKOSMε β;Wn∪↔I↓$q →%"βOW∂BβS#π h)⊃∪e_π5#~I↓5αfc?≤π[ Cyβ↔a↓q5cGGW∞"s#␈Co≠?∩βyEbβMqβ A2s/;<4RAE]%" 4+←F+;↔[/⊃↓∪5βqα9 2I⊃;xhP4*sπ∪??≠⊗+∨'9∧K9β[N+]β?2α3↔7n αFsw+311π;∃β∂∞p4+C⊗{[∃β&C'Mβ⊗+OW3"β'→β>)β∂πrβO#?:βS#π"βS#↔⊗)β'Mε β;Wn∪↔I↓$i⊃β∪/β↔;∪Ns≤4+}q↓⊂→"βOW∂BβS#π"aβ≠?∩↓⊃EbβMqβ A⊃0hQ⊃∪rλ5#MJ↓5αsf{≤πmβyβKb↓q~gGWπ%c#?G[π;∪ucGGW∞!βrH∞i#M'b↓p4(5c↔G;z↓!EaJ!⊂4+6{Iβπfa↓∪5βqα5⊃ph(4*O!β'Mεs?Qβ&K≠≠'∨+3Qβ&yβO?g3∃βSF)βK↔∨+KK↔v≠∃β≠␈∪7W3λh)!E
Iβ≠?∩↓∩Hπma1⊃i¬;∃β#∂3∃↓∩⊂ A#MJ↓u↓5
!1↓∩⊂ E#MJ↓u↓5
↓-βI␈→⊃1↓%⊂¬I#~H4)uαiE↓-αCI?MMc'∨g↓↓E↓Zαs39αCM?IMc'∨↔↓↓⊃1ε;⊃βNqβ∨↔v+Kπ0hQ⊃∩H∞i#M%βi↓5EαYβoJf{[↔Iπ≠zsN;∨3Aβ ↓-β[
s?[/⊂4)E∂rs39¬c3↔≠"CNs?6+IβJg∪'∨#"I↓-αf≠∪?S~↓-βm
c?[↔∩↓#5↓j↓E%πuc3↔≠"Bs39¬c3↔≠"CNs?6+H4+∃cK'∨G!&sKN;#Q&w[55Guc'∨?∪A↓:f+G;=αAEe%" 4*≠␈⊃βS#*βOSπ&+⊃βK∞s∨∃β}1↓∪M"aβS#O→β∂?w3↔K∨/→βW;N3?K7gIβS<hQ⊃⊃5
↓-↓#∩{M%αf+cAαf∪'∨3ααs39αCM?IMc'∨↔↓↓u↓αq⊃⊂4Ph*S#*βK↔∂/∪K↔;≡)↓!E
Iβ≠?∩↓∩Dπj!βSπ↑+L4+&C∃β≠␈∪44)""Dπ5G→%↓uπYFs?6+IβOuc'∨>cAβ∞i↓-↓
↓-αsNsQα{←≠xπm∂rDπojiGy#"Jq04V#RsN;∨KAec↔G;z↓!IAJ!⊂4+>C↔K∀hQ⊃∪∞i↓uβ[
s?[/⊃↓gzf∪'∨∨gαs';"α{mEπpπmGuλπo5ky#QMa3∪QαYαs'w 4*{[ Cxπ←∪zHπ↑i5GyG!&q3'"s'>;KA↓j↓E:s/;=↓C⊃E%⊃ h*S#*βO?3/#'?9π#=βK.≠WKK.s∂∃↓C⊃A%βO→β↔π≡K3eβ6{W;⊂hSeβ'∪g';:β?WQπ##∃β6KKOQε3↔]β≡O↔Mε;⊃β?+↔OONs≥βπ"β¬β≠␈∪7W3λh+S#∂!β∂πrβ∃βπ∪?[↔"βeβNs∪W∂&K?9mπ;∃β≠Ns⊃βSFP4)""Dπ5G→%↓uβ ↓-β[
s?[/⊃βOy¬c3↔≠"Cπ5αYβmFf{[↔Iβ πyβ_o55∂p4*sfqβM↓]c∂∪?'→↓-β[
s?[/⊃↓#5αi↓E%∂qβ¬
Bs39π→&{ojiGzs⊗K∨#QJrs↔Gvy!IIJ!⊂4(hR'Qβ⊗+7π'w→β≠?∩βWMβ&yβ∂πf≠W3π&)βS#*β∂?↔63'∂'.sSL4R#π5"aβ←#N≠!βJ↓!EeJa↓!I
I1βπv!↓!I∩IβOπ&KO≠eπ##∃β⊗+3πSN{;L4R!∩s∂≠↔3'v+O/'β⊃←CRf+Gπ3N;;oλY↓uαCI↓5β A%=KZs∂HhSπojYGxYjβmFs␈3↔I↓Ors'>;3Aβ_6s3r↓EA↓ZβmFs␈3↔I↓∩yβ∂[55GrBs39β A&y∩↓.s∂&{SL4RYβmFf{[↔Iεiπyβ_ E"sfq↓EAMs6s∂⊂h(ZsG≠/'A;βCRsw+31-π∩s'>;3A↓
↓-βm
c?[↔∩↓Eπzfc9βmαs?[/⊃βKyαZs∂∪␈#L4)ZβmFs␈3↔IβjzsN;∨3Bfc9βmαs?[/⊃βKzf∪'∨∨↔α{6qec'∨?∪A↓5β BsN;∨KAuc∂Kzf+G;=αAIM%" 4*SFKMβO/W↔;≡)βπCε+πKMεQβ≠O∪OQβ&yβ∃π3↔Keε≠?7CfK∂πS.!04+↔+Qβπ∨#Wπ3gIβ←∃ε≠π9β∞sπ3gV)β'Qπ;'S#␈+Qβ∪N3≠'∂.cSeβ>KS!β&C∃β#.c@4+}1β∨↔v+KπSNs≥β≠.s∂S'}sM9αf+P4)""
#iJ↓uβi↓-ε_¬Kjs⊃↓-β_ Ojy~↓.s∂&{SOMZ!⊂4+&C↔9β≡K;∂∃α!EB{R↓u↓EαYβiαfc9↓Eα↓-↓!
yI¬%αCiαsfp4)EαJyI↓]c∂∪?'≠M⊃1π;∃β∪.#W∂∃π##πPhQ⊃∩s/π3'>soπ↑i-Gy↓1uβm
c?[↔∩↓ECyε_πo5[y↓-πYfs?6+I↓EπsπojYGzs∨⊂4*sv{π3'>sns[≡['ASπ#x4(3kmFs␈3↔I↓βzsN;∨3Aε_πo5[y↓-ε_π6sfq↓EAαZs∂∪␈#L4)ZβmFs␈3↔Iβjyβ "s3r↓EA&vjs'>;KBs∨⊂4*sv{π3'{\vskip2pt}
⊗\hskip90pt\null+ {r\over 10}\bigglp 1 +\cdots
+ {1\over m!}\bigglp\ln {10\over r}\biggrp↑m\,\biggrp- 1\cr}$$
is the coefficient of $z↑{m+1}$ in the function
$${1\over 10}C(z)10↑z + {r\over 10}\bigglp{10\over r}\biggrp↑z\bigglp{z\over
1 - z}\biggrp - {z\over 1 - z} .\eqno (24)$$
This condition holds for all values of
$m$, so (24) must equal $C(z)$, and we obtain the explicit formula
$$C(z) = {-z\over 1 - z}\,\bigglp{(10/r)↑{z-1} - 1\over 10↑{z-1}
- 1}\biggrp.\eqno (25)$$
We want to study asymptotic properties of
the coefficients of $C(z)$, to complete our analysis. The large
parenthesized quantity in (25) approaches $\ln (10/r)/\ln 10
= 1 - \log↓{10} r$ as $z → 1$, so we see that
$$C(z) + {1 - \log↓{10} r\over 1 - z} = R(z)\eqno (26)$$
is an analytic function of the complex variable
$z$ in the circle
$$|z| < \left| 1 + {2πi\over \ln 10}\right| .$$
In particular, $R(z)$ converges for $z = 1$, so
its coefficients approach zero. This proves that the coefficients
of $C(z)$ behave like those of $(\log↓{10} r - 1)/(1 - z)$, that
is,
$$\lim↓{m→∞} c↓m =\log↓{10} r - 1.$$
Finally, we may combine this with (22), to
show that $Q↓m(s)$ approaches
$$1 + {\log↓{10} r - 1\over s}\bigglp1 + \ln s + {1\over 2!}
(\ln s)↑2 +\cdotss\biggrp \;=\; \log↓{10} r$$
uniformly for $1 ≤ s ≤ 10$.\quad\blackslug
%folio 338 galley 21 (C) Addison-Wesley 1978 *
\yyskip Therefore we have established the logarithmic law
for integers by direct calculation, at the same time seeing
that it is an extremely good approximation to the average behavior
although it is never precisely achieved.
\yskip The above proofs of Lemma Q and Theorem F are slight
simplifications and amplifications of methods due to B. J. Flehinger,
{\sl AMM \bf 73} (1966), 1056--1061. Many authors have written
about the distribution of initial digits, showing that the logarithmic
law is a good approximation for many underlying distributions;
see the survey by Ralph A. Raimi, {\sl AMM \bf 83} (1976), 521--538,
for a comprehensive review of the literature. Another interesting (and different)
treatment of floating-point distribution has been given by
Alan G. Konheim, {\sl Math.\ Comp.\ \bf 19} (1965), 143--144.
Exercise 17 discusses an approach to
the definition of probability under which the logarithmic law
holds exactly, over the integers. Furthermore, exercise 18 demonstrates that
{\sl any} reasonable definition of probability over the integers must lead to
the logarithmic law, if it assigns a value to the probability of leading digits.
\exbegin{EXERCISES}
\exno 1. [13] Given that $u$
and $v$ are nonzero floating-decimal numbers {\sl with the same
sign}, what is the approximate probability that fraction overflow
occurs during the calculation of $u \oplus v$, according to
Sweeney's tables?
\exno 2. [42] Make further tests of floating-point addition
and subtraction, to confirm or improve on the accuracy of Sweeney's
tables.
\exno 3. [15] What is the probability that the two leading digits
of a floating-decimal number are ``23'', according to the logarithmic
law?
\exno 4. [M18] The text points out
that the front pages of a well-used table of logarithms get
dirtier than the back pages do. What if we had an {\sl antilogarithm}
table instead, i.e., a table giving the value of $x$ when $\log↓{10}
x$ is given; which pages of such a table would be the dirtiest?
\trexno 5. [M20] Suppose that $U$ is a real number uniformly distributed
in the interval $0 < U < 1$. What is the distribution of the
leading digits of $U?$
\exno 6. [23] If we have binary computer
words containing $n + 1$ bits, we might use $p$ bits for the
fraction part of floating-binary numbers, one bit for the sign,
and $n - p$ bits for the exponent. This means that the range
of values representable, i.e., the ratio of the largest positive normalized value
to the smallest, is essentially $2↑{2↑{n-p}}$.
The same computer word could be used to represent floating-{\sl
hexadecimal} numbers, i.e., floating-point numbers with radix
16, with $p + 2$ bits for the fraction part $\biglp (p + 2)/4$
hexadecimal digits$\bigrp$ and $n - p - 2$ bits for the exponent;
then the range of values would be $16↑{2↑{n-p-2}}=2↑{2↑{n-p}}$,
the same as before, and with more bits
in the fraction part. This may sound as if we are getting something
for nothing, but the normalization condition for base 16 is weaker
in that there may be up to three leading zero bits in the fraction
part; thus not all of the $p + 2$ bits are ``significant.''
On the basis of the logarithmic law, what are
the probabilities that the fraction part of a positive normalized
radix 16 floating-point number has exactly 0, 1, 2, and 3 leading
zero bits? Discuss the desirability of hexadecimal versus binary.
\exno 7. [HM28] Prove that
there is no distribution function $F(u)$ that satisfies (5)
for each integer $b ≥ 2$, and for all real values $r$ in the
range $1 ≤ r ≤ b$.
\exno 8. [HM23] Does (10) hold when $m = 0$ for suitable $N↓0(ε)$?
\exno 9. [HM24] (P. Diaconis.)\xskip Let $P↓1(n)$, $P↓2(n)$, $\ldots$ be
any sequence of functions defined by repeatedly averaging a given function
$P↓0(n)$ according to Eq.\ (9).
Prove that $\lim↓{m→∞} P↓m(n) = P↓0(1)$ for all fixed $n$.
\trexno 10. [HM28] The text shows that $c↓m = \log↓{10} r - 1
+ ε↓m$, where $ε↓m$ approaches zero as $m→∞$. Obtain the next
term in the asymptotic expansion of $c↓m$.
\exno 11. [M15] Given that $U$ is a
random variable distributed according to the logarithmic
law, prove that $1/U$ is also.
\exno 12. [HM25] (R. W. Hamming.)\xskip The purpose of this exercise
is to show that the re\-sult of floating-point multiplication
tends to obey the logarithmic law more perfectly than the operands
do. Let $U$ and $V$ be random, normalized, positive floating-point
numbers, whose fraction parts are independently distributed
with the respective density functions $f(x)$ and $g(x)$. Thus,
$f↓u ≤ r$ and $f↓v ≤ s$ with probability $\int ↑{r}↓{1/b}\int
↑{s}↓{1/b}f(x)g(y)\,dy\,dx$, for $1/b ≤ r, s ≤ 1$. Let $h(x)$
be the density function of the fraction part of $U \times V$
(unrounded). Define the {\sl abnormality} $A(f)$ of a density
function $f$ to be the maximum relative error,
$$A(f) = \max↓{1/b≤x≤1} \left|f(x) - l(x)\over l(x)\right|,$$
where $l(x) = 1/(x \ln b)$ is the density of the
logarithmic distribution.
Prove that $A(h) ≤ \min\biglp A(f),
A(g)\bigrp$.\xskip$\biglp$In particular, if either factor has logarithmic
distribution the product does also.$\bigrp$
\trexno 13. [M20] The floating-point multiplication routine, Algorithm
4.2.1M\null, requires zero or one left shifts during normalization,
depending on whether $f↓uf↓v ≥ 1/b$ or not. Assuming that the
input operands are independently distributed according to the
logarithmic law, what is the probability that no left shift
is needed for normalization of the result?
\trexno 14. [HM30] Let $U$ and $V$ be random, normalized, positive
floating-point numbers whose fraction parts are independently
distributed according to the logarithmic law, and let $p↓k$
be the probability that the difference in their exponents is
$k$. Assuming that the distribution of the exponents is independent
of the fraction parts, give an equation for the probability
that ``fraction overflow'' occurs during the floating-point
addition of $U \oplus V$, in terms of the base $b$ and the quantities
$p↓0$, $p↓1$, $p↓2$, $\ldotss\,$. Compare this result with exercise
1.\xskip (Ignore rounding.)
\exno 15. [HM28] Let $U$, $V$, $p↓0$, $p↓1$, $\ldots$ be as in exercise
14, and assume that radix 10 arithmetic is being used. Show
that regardless of the values of $p↓0$, $p↓1$, $p↓2$, $\ldotss $, the
sum $U \oplus V$ will {\sl not} obey the logarithmic law exactly,
and in fact the probability that $U \oplus V$ has leading digit
1 is always strictly {\sl less} than $\log↓{10} 2$.
\exno 16. [HM28] (P. Diaconis.)\xskip Let $P↓0(n)$ be 0 or 1 for each
$n$, and define ``probabilities'' $P↓{m+1}(n)$ by repeated averaging,
as in (9). Show that if $\lim↓{n→∞} P↓1(n)$ does not exist, neither
does $\lim↓{n→∞} P↓m(n)$ for any $m$.\xskip [{\sl Hint:} Prove
that the conditions $(a↓1 +\cdots + a↓n)→0$ and $a↓{n+1} ≤ a↓n +
M/n$, for some fixed constant $M > 0$, imply that $a↓n → 0$.]
\trexno 17. [HM25] (R. L. Duncan.)\xskip Another way to define $\Pr\biglp
S(n)\bigrp$ is to evaluate the quantity
$\lim↓{n→∞} \biglp (\sum ↓{S(k)\hbox{\:e\ and }
1≤k≤n}\,1/k)/H↓n\bigrp $, since it can be shown that this
``harmonic probability'' exists and equals $\Pr\biglp S(n)\bigrp$
whenever the latter exists according to Definition 3.5A.
Prove that the harmonic probability of the statement ``$(\log↓{10}n)\mod1<r$''
exists and equals $r$.\xskip$\biglp$Thus, initial digits of integers {\sl exactly}
satisfy the logarithmic law in this sense.$\bigrp$
\trexno 18. [HM30] Let $P(S)$ be any real-valued function defined on sets $S$
of positive integers, but not necessarily on all such sets, satisfying the
following rather weak axioms:
\yskip\hang\textindent{i)}If $P(S)$ and $P(T)$ are defined and $S∩T=\emptyset$
then $P(S∪T)=P(S)+P(T)$.
\hang\textindent{ii)}If $P(S)$ is defined then $P(S+1)=P(S)$, where $S+1=\leftset
n+1\relv n\in S\rightset$.
\hang\textindent{iii)}If $P(S)$ is defined then $P(2S)={1\over2}P(S)$, where
$2S=\leftset2n\relv n\in S\rightset$.
\hang\textindent{iv)}If $S$ is the set of {\sl all} positive integers then $P(S)=1$.
\hang\textindent{v)}If $P(S)$ is defined then $P(S)≥0$.
\yskip\noindent Assume furthermore that $P(L↓a)$ is defined for all positive
integers $a$, where $L↓a$ is the set of all integers whose decimal representation
begins wih $a$:
$$L↓a=\leftset n\;\left|\;10↑m a ≤ n < 10↑m(a+1) \hbox{ for some integer }m
\rightset\right..$$
(In this definition, $m$ may be negative; for example, 1 is an element of
$L↓{10}$, but not of $L↓{11}$.)
Prove that $P(L↓a)=\log↓{10}(1+1/a)$ for all integers $a≥1$.
\vfill\eject